CS294 Paper Review Blog: Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

As with the other two papers we've read this week, this is addressing a real problem with the need for better streaming systems as stream processing becomes increasingly important and increasingly distributed.

This paper focuses more on the programming model ("Dataflow Model") rather than a specific implementation, and the main idea is what this Dataflow Model is able to represent, primarily windowing (including session windowing), and triggering to allow for the application to decide how to handle late data / when to output results.

This model was motivated by the desire to have clear semantics that are able to handle a very wide variety of use-cases, due to the authors encountering cases where current models were not well-suited to expressing certain desired pipelines, and especially that no single model was able to express all desired pipelines.

The paper very clearly addresses fundamental trade-offs and, specifically, does not address them in a certain direction in the model, leaving them in such a way that they can be tuned for different pipelines. They create the model in such a way to support as many different ends of the spectrum as possible - emit results very greedily and don't worry about having to fix them later, wait until all results arrive, and anywhere in between. One trade-off I see that they do not discuss is the abstraction layer of their model. They specifically design the model to ignore the distinction of batch vs. micro-batch vs. streaming, which allows the programmer the simplicity of considering them all to be the same, but as with any higher-level abstraction this may come with penalties, e.g. in terms of performance and optimizability.

This paper does prevent what seems to be a very well thought out flow model influenced by a lot of real world experience, and most notably the ideas of windows and triggers are pretty general and could be included in any future system. I am tempted to say that this paper could definitely be influential in the future, especially since it comes out of Google, which has been known for producing very influential research papers in the past as well.

CS294 Paper Review Blog

Tuesday, September 15, 2015

Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

No comments:

Post a Comment