Spark Streaming's D-Streams address an emerging problem; streaming computation that is distributed, while maintaining strong fault tolerance and straggler mitigation. Numerous streaming systems already exist that do similar things, but Spark Streaming does seem to be addressing a real problem with much better failure and straggler semantics.
The main idea, one very novel in the world of streaming, is to process all records as mini-batch computations rather than attempting to individually handle each incoming record as it arrives. This allows for the leveraging of a lot of techniques used in batch processing. It does, however, mean that minimum latency is around 500-2000ms, as opposed to many streaming systems that have less latency.
Previous streaming systems have all focused on ensuring that latency is as low as possible. This paper takes a different approach, assuming that in most real world systems latency of 1-2 seconds is perfectly acceptable, and leverages this assumption to enable the use of a very different set of techniques than traditional streaming systems.
There is a fundamental trade-off here of latency vs. a number of other things (bandwidth, fault tolerance, ease of integration with batch processing). By increasing the latency (but leaving it at a point assumed to be acceptable), bandwidth is higher than that of other streaming systems examined, streaming computation can more easily be integrated with batch computation, and the system becomes more fault tolerance (and is able to recover from failure more quickly).
I think the idea of mini-batch is certainly promising, though the streaming field has a lot of other competitors right now and it is certainly possible that others will prove to be much more successful / influential. Hard to say if I think I see this being influential.
No comments:
Post a Comment