Paper Insights - The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Paper Link

This influential paper from Google, presented at VLDB 2015, is a landmark in Data Engineering. Authored by Tyler Akidau, a distinguished engineer at Snowflake and former Google employee, it explores groundbreaking concepts. Akidau's work on Millwheel significantly influenced Apache Flink, while his Dataflow Model laid the groundwork for Apache Beam. Notably, Google Cloud Dataflow implements the Apache Beam framework.

For a deeper understanding, I recommend a valuable YouTube playlist that effectively explains the core ideas presented in this paper.

Motivating Example

This paper focuses on streaming systems. For a better understanding of the context, I recommend reviewing my previous post, Paper Insights - Apache Flink™: Stream and Batch Processing in a Single Engine, which explains the key differences between batch and stream processing.

In a streaming system, events arrive continuously and are processed in an ongoing manner. The core concept of this paper is to process these events based on their "event time" – the actual time when the event occurred.

Challenges of Event Time Processing

Processing events based on their event time often leads to out-of-order processing. This is because events may not arrive at the processing system in the same order as they were generated.

Example:

Consider a stream of logs generated by end-user devices. To process these logs in batches, we can define a window (e.g., 2 minutes). All events arriving within this window are collected and processed together. However, we have multiple notions of time:

Processing Time: The time when a log reaches the processing system is referred to as "processing time".
Event Time: The actual time when a log was generated is its "event time".

For example, if a log was generated at 9:58 AM and reached the system at 9:59 AM, its event time is 9:58 AM and its processing time is 9:59 AM.

Processing events based on processing time window is relatively simpler. Systems like Spark Streaming operate based on processing time semantics, collecting events into batches based on their arrival time.

Processing events based on event time window is complex. With billions of devices generating logs, it's difficult to determine when all events within a specific event time window will arrive at the processing system. For instance, if we want to process all events generated before 10:00 AM, we need to know when all events from that timeframe have been received.

Watermarks:

To address this challenge, the concept of "watermarks" is introduced. A watermark is a control message sent to the system to indicate that all events generated before a certain time are likely to have arrived. For example, a watermark at 10:01 AM may suggest that all events generated before 10:00 AM have probably been received, allowing the system to proceed with processing the batch.

Limitations of Watermarks:

Even with watermarks, there's still a possibility of "late" events – events that arrive after their expected event-time based processing window has closed. For instance, a log generated at 9:59 AM might arrive at the system at 10:05 AM. Since the watermark for the 10:00 AM window has already passed, the system will be forced to ignore this late event.

Apache Flink makes use of watermarks and has a similar limitation.

The Dataflow Model:

To address the issue of late events, the Dataflow model provides mechanisms to handle them appropriately. This ensures that the impact of late events is accurately reflected in the system's processing, even though they arrive after their designated batch has been processed.

Dataflow

The Dataflow model represents a novel approach to stream processing, offering a unique set of advantages:

Event Time Processing: Processes events based on their actual occurrence time, not just their arrival time.
Fault Tolerance: Resilient to system failures, ensuring data integrity and continuous processing.
Eventual Correctness: A novel idea that allows for controlled trade-offs between processing latency and accuracy.
Scalability: Designed to handle massive volumes of data with high throughput.
Low Latency: Achieves minimal processing delays for time-sensitive applications.

The core of the Dataflow model lies in its ability to flexibly adjust the balance between latency and correctness.

Event time v/s Processing time

Event Time: The time at which the event actually occurred.
Processing Time: The time at which the event is processed by the system.

Ideally, processing time should perfectly align with event time, meaning events are processed immediately upon occurrence. However, in reality, events may arrive at the processing system with a delay.

Challenges of Out-of-Order and Late Events

Delays can cause processing discrepancies. Late events, those arriving significantly after their expected time, complicate the situation. Maintaining the correct processing sequence becomes challenging.

In the graph above, an event appearing late can result in out-of-order processing.

Both Spark and Flink ignore this event as "late" leading to incorrectness.

For most real-world applications, it's common to encounter out-of-order, late events. Effectively handling these situations is critical for ensuring the accuracy and reliability of the stream processing system.

Watermarks

Watermarks are mechanisms (either internal or external to the system) that signal the system's knowledge of event arrival. Watermarks always correspond to event time. They represent specific points in the event time space, indicating that all events occurring before that point are believed to have arrived at the system.

It's important to note that watermarks may not always be accurate reflections of event arrival. Events can arrive late leading to out-of-order arrivals. Despite this, the watermark itself must always monotonically increase. This ensures that the system's understanding of event arrival progresses forward in event time.

The watermark line progresses monotonically as events arrive. Events that arrive after their corresponding watermark is established are considered "late" events.

Windows and Windows Pane

Windows are defined as segments or regions within the event time space. They can be:

Overlapping: Windows can partially overlap with each other.
Non-overlapping: Windows have no overlap.
Aligned: All windows cover the same event time bounds across all event keys.
Unaligned: Window event time bounds may vary across different event keys.

Types of Windows:

Fixed Windows (Tumbling Windows): Non-overlapping windows with a fixed duration.
Sliding Windows: Overlapping windows with a fixed duration, sliding forward at regular intervals.
Session Windows: Capture activity within a specific time period for a particular subset of data, unaligned.

The paper provides a detailed algorithm for creating windows from event streams in section 2.2. I will not delve into the specifics here, as the paper itself offers a clear explanation with illustrative examples.

Window Panes and Processing

A window of events in event time space may be processed multiple times due to the arrival of late events. Each processing instance of a window is referred to as a "pane".

Triggering Window Pane Processing (Vertical Axis Bounds):

To initiate the processing of a window pane, specific triggers are required:

Processing Time Intervals: Trigger processing at regular intervals (e.g., every minute).
# Events: Trigger processing when a certain number of events have arrived.

Triggering Window Processing (Horizontal Axis Bounds):

To determine when a window is complete in event time, divide the event time space into equal intervals and then utilize watermarks to track the progress of event time and identify when a specific point in event time has likely been reached.

The processing machine has the knowledge of processing time using system clock, however, it lacks direct knowledge of event time progression. Watermarks are crucial for tracking the advancement of event time within the system.

Triggering the Rectangle

By combining triggers in both the horizontal (processing time) and vertical (event time) dimensions, the system can effectively determine when a rectangular window is ready for processing.

The paper provides great examples in section 2.4, along with API usage to explain how windowing mechanism works for different window types.

Incremental Processing

Since the same window may have multiple panes due to late events, the system needs an efficient mechanism to handle these incremental updates. Three common approaches are:

Discarding: Discard all previously processed panes for the window and process only the events in the new pane. This is the most efficient method.

Accumulating: Accumulate all events across all panes of the window and process them together. This approach is suitable only if the processing logic overwrites previous results with the latest known value.

Accumulating and Retraction: This two-step process involves first retracting (or reversing the effect) of the previously processed panes and then processing the new pane, which now includes all accumulated events for the window. This method is powerful but requires that the consumers of the processed results can effectively reverse the impact of previous processing. This is not feasible if the processing results have already been committed as part of a transaction.

Related Systems

Google Dataflow: The concepts presented in this paper have been materialized in Google Cloud Dataflow, a powerful stream and batch processing service built upon the foundations of FlumeJava and MillWheel.

FlumeJava (2010): An early batch processing system built on the MapReduce framework. It provides a set of APIs, specifically immutable parallel collections, for writing MapReduce programs. FlumeJava acts as a compiler framework, optimizing user-defined MapReduce applications.

MillWheel (2013): A low-latency stream processing engine where user-defined logic is represented as a directed graph. Each node in the graph represents a processing step. MillWheel guarantees exactly-once processing semantics.

Dataflow (2013): A unified framework that seamlessly integrates batch processing (leveraging MapReduce) and stream processing (utilizing MillWheel). It offers a simplified programming model and high-level abstractions. Dataflow is open-sourced as Apache Beam and is also available as a managed service on GCP as Cloud Dataflow.

Other Google Stream Processing Technologies: Google also developed Photon, a stream processing engine that evolved from Ubiq. A key distinction of Photon is its multi-homed architecture, where jobs are replicated globally while maintaining exactly-once semantics through PaxosDB-based transactions.

Google Query Engines: Google has a suite of powerful query engines that makes use of data processing system under the hood:

Dremel: For interactive query processing. It is the brain behind Google's BigQuery. Underneath it makes use of MapReduce.
F1 Query: Google's most advanced query engine, also leveraging MapReduce for efficient query execution.

Key Technologies Outside Google:

Hadoop: A foundational framework for distributed storage and processing of large datasets.
Pig: A high-level language for expressing dataflow programs that execute on Hadoop. (Relationship: Pig to Hadoop :: FlumeJava to MapReduce)
Hive: Built on top of Hadoop, enabling SQL-like data querying and analysis.
Spark: A fast and general-purpose cluster computing system that extends the MapReduce model.

Spark Streaming: An extension of Spark for stream processing, primarily focusing on micro-batching.

Apache Storm: A distributed stream processing platform that provides at-least-once processing guarantees.

Paper Review

This paper does not introduce a novel system architecture for data processing. Instead, it leverages existing systems and introduces a new concept for achieving a balance between correctness and latency in data processing.

I initially found the paper's concepts challenging to grasp. The most difficult aspect was comprehending the problem statement. However, once I fully understood the problem, the rest of the paper became much easier to follow. I highly recommend this paper to anyone interested in Big Data.

Search This Blog

Nano