CS294 Paper Review Blog: Review of "Spark: Cluster Computing with Working Sets"

Spark addresses a very real problem with the previously available tools for running large-scale distributed algorithms that are of an iterative nature; the most commonly used, MapReduce, has no support for keeping data which will be used multiple times in memory, so it must be reloaded upon each iteration, which can significantly limit performance. Spark addresses this by introducing the concept of resilient distributed datasets (RDDs) which are cacheable, meaning that they can be loaded/computed a single time and then have numerous computations performed on them. Spark also introduces accumulators to allow for add-only communication/data aggregation between nodes, as well as broadcast values to achieve data persistence similar to RDDs for large amounts of data that will need to be accessed by all nodes, but the RDDs are the primary focus.

This solution is different from previous work in part because it is a relatively new problem, and in part because of new workloads. Even MapReduce is a relatively new paradigm, and was not built with iterative algorithms in mind. It was extensible enough to handle the new use cases (e.g. machine learning, logistic regression), but as we see in the paper's results it did so significantly less efficiently than possible.

There are a few trade-offs here. Using Scala rather than pure Java (also likely partially due to the overall increase in system complexity) means that there is a higher overhead (as compared to MapReduce) for applications which do not make use of Spark's caching power as seen in the significantly longer time to complete the first round of logistic regression (Section 5). This means that Spark will not work as a complete MapReduce replacement, but rather that each will be best suited to different applications. Another trade-off discussed in the paper is tuning the parameters of the RDDs to optimize for various things: storage cost/size, access speed, probability of losing part of it, and recompute cost. They discuss that part of their future work is to make RDDs user-tunable to optimize based on what makes the most sense for each specific application.

I can definitely see this paper being influential in 10 years. Machine learning has become a necessity that is employed in increasingly many areas, and most (all?) machine learning requires the iterative algorithms that Spark is so well suited for. The amounts of data being processed are increasing very rapidly, so the need for efficient distributed computation is continuing to become more and more vital, and Spark (despite its young age) is already seeing adoption in many major players in the technology industry.

CS294 Paper Review Blog

Tuesday, September 8, 2015

Review of "Spark: Cluster Computing with Working Sets"

No comments:

Post a Comment