代码之家 › 专栏 › 技术社区 › Keshore Durairaj

实时流数据聚合

apache-storm bigdata apache-spark

Keshore Durairaj · 技术社区 · 8 年前

有人能解释一下如何使用storm、spark等大数据技术对实时流数据进行聚合吗。。因为数据一直在流动,所以对数据流进行计算是没有意义的

2 回复 | 直到 8 年前

Jungtaek Lim 8 年前

大多数流框架支持“窗口”,该窗口在窗口中收集元组(事件),并将其呈现为聚合状态。滚动窗口和滑动窗口得到广泛支持,窗口单位为计数(元组)和时间。

您可以参考以下链接了解window的概念:

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

您可以通过窗口计算最近N分钟(可能是秒、小时左右)的元组聚合。您可能会觉得该操作是批处理,是的,您也可以通过将元组推送到外部存储来完成,并使用批处理框架进行一些聚合。

在正常情况下,批处理框架中的聚合将更有效(聚合操作是面向批处理的),但流式框架上的即时聚合不需要外部存储(如果窗口适合内存),也不需要额外的批处理框架。

Mahesh Chand 8 年前

window . 我们首先按窗口对数据进行分组,在窗口中指定时间列和时间。Spark将积累给定时间的数据,然后我们将对分组数据应用聚合。例如

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", "10 seconds"),
  $"word"
).count()

全面了解流聚合 refer

推荐文章

Ajeesh · Apache Storm(1.2.1)运输异常断管

8 年前

TechCrap · apache storm与python混合拓扑-ModuleNotFoundError:没有名为“storm”的模块

8 年前

Aniruddha · 在storm群集上提交拓扑时出错

8 年前

kingluo · storm:bolt如何执行元组?

8 年前

Ahmad Osama · 将数据从Apache Storm插入Azure Cosmos DB

8 年前

Saurabh · 在Apache Storm中使用与多个螺栓相同的类

8 年前

Keshore Durairaj · 实时流数据聚合

8 年前

Siva S · Apache strom—包的返回类型。暴风雨元组不存在

8 年前

Ryanqy · 监管在风暴中意味着什么?

8 年前

jdowdell · 流处理架构:未来事件影响过去结果

8 年前