CS294 Paper Review Blog: Review of "High-throughput, low-latency, and exactly-once stream processing with Apache Flink"

Flink is addressing a real problem that all three of our papers are addressing this week: the increasing need for fast stream processing. The solution's main idea is the idea of distributed snapshots, a method based on the Chandy-Lamport to create a snapshot of the entire distributed system without the necessity for pausing everything.

The solution is one of many new methods being developed to make fast, hugely distributable streaming systems with strong fault tolerance guarantees. The field is growing rapidly, so many new ideas are being generated in this space, with Spark Streaming, Google Dataflow, and Flink all taking slightly different approaches.

There will always be trade-offs in terms of the different fault tolerance mechanisms used by these systems, which, as discussed in this paper, work in no small part to influence the rest of the system. The main trade-off here is the complexity of a single snapshot; if you are working with an application that requires a significant amount of state, each snapshot will be expensive, but as a trade-off, you can then schedule less frequent snapshots. But this will require longer failure recovery times, especially if you have a heavily loaded system, whereas micro-batch suffers from less of this problem.

Apache Flink seems to basically just be a real-world implementation of the algorithm described in the Chandy-Lamport paper, so I have doubts that this paper specifically will be influential in 10 years; if this type of system catches on, it will likely be the Chandy-Lamport paper that continues to be influential (despite its age), though the Apache Flink application itself may well be influential in the future.

CS294 Paper Review Blog

Tuesday, September 15, 2015

Review of "High-throughput, low-latency, and exactly-once stream processing with Apache Flink"

No comments:

Post a Comment