This paper definitely addresses a real problem in stream processing, though (as they admit) there are already tools to do most of the things Naiad aims to be, except that Naiad attempts to combine it together in a more cohesive and extensible manner. This is good, though it doesn't seem groundbreaking.
The solution's main idea is the "timely dataflow" concept, which uses stateful vertices with a simple interface that allows them to accept and output messages. By using timestamps which represent which portion of the data a message belongs to, they are able to allow for parallel processing and supply notifications to a vertex to alert it that there are no more messages for a given data period, allowing the vertex to do any necessary finalizing action. This allows for coordination where necessary (waiting for notification) but low-latency when it isn't necessary (can immediately transform and output received messages).
The solution seems to be different from previous work mainly in the sense that, although what they're trying to do can be done, a lot of the existing tools are not as mature as they would like and do not encompass as wide of a breadth as they would like to be integrated together.
The paper identifies some fundamental trade-offs, e.g. it notes that batch-processing systems get advantages from the longer-running nature of their computation and the lack of requirement for real-time updating (which adds overhead). It also notes the trade-off between fault tolerance schema; logging on every action may work better in some cases, and checkpointing at certain intervals may work better in others. They also discuss that micro-stragglers can be very detrimental to latency when working on such rapid timescales, as is apparent in Figure 6b, but note that because they use a stateful vertex model it would be nontrivial to perform speculative execution to prevent this issue. The stateful vertices provide a trade-off focusing on streamlining performance while reducing fault and straggler tolerance.
I am not sure if I see this being influential in 10 years. Being able to make live updates on iterative computations that are also interactively query-able is very useful and I can see that many people would want that ability, especially as we move forward in terms of increasing data collection, and the ideas about timely dataflow with their pointstamp / occurrence count may prove to be useful. At the same time, I don't see anything so novel that it makes me feel it will still be useful in 10 years. Hard to say. Maybe I'm just biased because they're working in C#.
No comments:
Post a Comment