CS294 Paper Review Blog: September 2015

Tuesday, September 29, 2015

Review of "Omega: flexible, scalable schedulers for large compute clusters"

As with the DRF paper, this is a real problem: heterogeneous scheduling in large compute clusters will only become more real as clusters become larger and workloads more diverse.

The main idea is to have multiple, completely independent schedulers all running in parallel and all able to access the entire cluster. Transactions to shared state are used to fully claim a resource, but schedulers are allowed to preempt each other.

This is different from previous work partly because the scale of clusters is reaching the point where the easiest design (a monolithic scheduler) is no longer viable. This is also different from newer things like Mesos because it is very specifically tailored to Google's business-needs driven approach and relies heavily on all schedulers in the system being written to play nicely with each other, which is not the view that is normally taken (usually, more focus on fairness between schedulers / users).

The scheduler identifies some hard trade-offs. For one, allowing schedulers to all attempt to take resources in parallel sometimes results in them not being able to do so, causing their attempted transaction to abort, which adds some overhead. There is also an enormous trade-off here in the sense of faster, more distributed scheduling vs less centralized control -- this works fine for Google since they have written their schedulers to all play nicely together, but will not always be the case.

I don't think I see this being influential in 10 years outside of Google -- while the concepts introduced are certainly interesting, I don't know that they will extend well beyond their walls.

Review of "Dominant Resource Fairness: Fair Allocation of Heterogeneous Resources in Datacenters"

This is definitely a real problem - there are many situations where many users need to share a large amount of cluster resources, and often the resources required by different tasks are extremely different. Scheduling them efficiently can provide large performance gains.

The main idea is to first consider a user's dominant resource, the one that requires the largest percentage of the cluster's resources (e.g. CPU might be a dominant resource if the user requested an amount of CPU that was a larger fraction of the cluster's CPU resources than the percentage amount of RAM it requested). To allocate by DRF, simply continue to allocate one task to the user with minimum dominant share (its share of its dominant resource).

This solution is different from previous work because the problem of large-scale cluster scheduling is relatively new, and becoming more of an issue only in more recent times.

The paper does identify a hard trade-off in DRF (and, they discuss, generally any scheduling algorithm) between resource monotonicity and the share guarantee / Pareto efficiency, giving an example of why both can't be satisfied simultaneously. But, they discuss that this is a trade-off they make very willingly, since new resource addition is a very infrequent event.

I don't know if I really see this being influential in 10 years. It was published 5 years ago, and has this actually been implemented anywhere? It seems like it is a good scheduling policy, but does it have a significant performance benefit over existing solutions?

Wednesday, September 23, 2015

Review of "Apache Hadoop YARN: Yet Another Resource Negotiator"

This is definitely a real problem - addressing two issues simultaneously, the scalability of the current Hadoop/MapReduce architecture along with the sharing of resources between MapReduce and other frameworks.

The solutions main idea is to have a central ResourceManager that controls global state, plus a per-application/job ApplicationMaster that tracks all of the state for that job. This allows different ApplicationMasters to run different frameworks, and reduces the load on the ResourceManager, since it has to manage a small amount of global state and the ApplicationMasters deal with more complex scheduler, load balancing, fault tolerance, etc.

This solution is different from previous work for two reasons. First, workloads and clusters have become larger than originally envisioned when Hadoop/MapReduce was architected, and clusters had reached their limitations on the original architecture. Second, more frameworks want to be run on the same cluster as more applications move to cluster-based processing, so MapReduce needs to play more nicely with and Hadoop needs to provide a better execution engine for these new frameworks.

There are definitely some fundamental trade-offs here. Introducing per-job ApplicationMasters increases overhead a little because there is more communication and latency between the RM -> AM. However, this reduces load on the central RM, providing better overall scaling. Another trade-off is the flexibility of YARN vs the features it provides; they note that creating applications to run on YARN is not really an easy task because you have to be concerned with all of the intricacies of load balancing, fault tolerance, etc., but this low level interface allows for very general models to run on YARN. The compromise is other solutions such as REEF that enable you to more easily build applications on YARN, taking care of some of these issues for you.

I think this will likely be influential in 10 years, in part just because I think the Hadoop ecosystem will still be around in 10 years and even though YARN may no longer be the resource manager of choice by then, it will certainly influence whatever succeeds it.

Review of "Kubernetes: Open Source Container Cluster Orchestration"

Yes, this is a real problem. As discussed with regard to Mesos, having multiple applications/frameworks sharing the same cluster resources can be very important in today's world where nearly all applications at a company like Google need to run on a scale that requires large numbers of nodes.

The main idea is to run each application within a container, which are grouped together into pods (which are scheduled together on the same machine) to enable e.g. local resource sharing. Developers can also use labels to group pods/containers into subsets of an application, e.g. different builds, different jobs, etc. This model allows applications to be scheduled and run independently of each other (even if they may be on the same machine) while also giving developers flexibility to control colocation when desirable.

This isn't that different from previous work, and seems mainly an evolution of previous systems. Containers like Docker provide some of the same ideas, and Google's previous system, Borg, seems to have very similar ideas in a less configurable and flexible manner.

There will always be a trade-off in terms of complexity of the containering system (e.g. how much protection it provides) and how much overhead it causes. Google's blog post does not make it clear what kind of overhead Kubernetes imposes, so it is difficult to gauge where it falls on the spectrum.

I am tempted to say that it might be influential in 10 years only for the fact that is a large project coming out of Google that is being open sourced, and Google has enough clout that generally large software projects it throws resources behind do well. It will remain to see, however, if Kubernetes will gain traction when Mesos and YARN already integrate nicely with the hugely popular Hadoop ecosystem and provide some similar features, though Kubernetes does seem to provide some higher level constructs which may be useful.

Review of "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center"

This is definitely a real issue - increasingly more applications are migrating to being run over large distributed commodity clusters, and enabling them to share the same cluster despite operating on different frameworks is an important task that should benefit data sharing and utilization.

The main idea is to have Mesos operate as a thin resource management layer underneath the existing frameworks, providing fine-grained resource offers to frameworks, which encapsulate a certain amount of resources on the cluster. A framework can then decide which resources specifically to accept, enabling the framework to enforce e.g. data locality.

One reason this is different is that previously, the type of applications run on commodity clusters were relatively restricted, and so the Hadoop/MapReduce model was sufficient as a framework. However, as more applications are reaching the scale of needing to run in a distributed manner, the MapReduce paradigm's limits have been reached and other frameworks are necessary to compete, and running these together on the same cluster is a relatively new issue.

One hard trade-off is the focus on the simplicity of Mesos. This was done to make it very adaptable and scalable, but comes with a few downsides. One example is having Mesos offer resources to the framework and allowing the framework to reject rather than having the framework request specific resource requirements; this allows more extensibility in terms of flexible resource requirements, but means that a framework may need to wait a long time. Also, Mesos still has the frameworks make lots of scheduling decisions rather than having Mesos be a complete global scheduler; again, this means frameworks can do more complex scheduling with Mesos needing to become complex, but means that utilization may be slightly lower than optimal.

I am not sure if this paper will be particularly influential in 10 years. I think cluster resource sharing will become an increasingly more important problem, and Mesos definitely seems to be an excellent implementation of this, but the ideas presented seem relatively simple, with the main concepts being resource offers and a very thin client.

Monday, September 21, 2015

Review of "The Case for RAMCloud"

This paper definitely discussed a real problem - applications increasingly want to load more complex data quickly, have a larger total amount of data, and want all of this in a redundant manner.

The solution's main idea is to implement a storage system fully in DRAM (possibly with SSDs or disks for increased durability) to provide very low latency, high bandwidth storage.

This is different from previous work because, at the time of publication, DRAM was just starting to reach the point where you could install large enough amounts on a server, at a cost effective price, to be able to store significant amounts of data purely in DRAM.

There are definitely some fundamental trade-offs here. DRAM does not provide the same durability guarantees as disk/SSD; even with replication, non-volatile storage is necessary as a backup in case of e.g. a power outage at a data center. It also costs significantly more, and requires more energy per bit of storage, than disks/SSDs. But, they run at significantly lower latency and higher bandwidth.

I do think this will be influential in 10 years -- it has already been 6 years since its publication and I have seen this paper discussed in some of the other modern papers we have read. Also, DRAM is significantly cheaper and more spacious than it was at the time of publication, making this type of system even more viable in today's world. While the volume of data collected has also increased hugely, meaning that I don't see RAMClouds being a full replacement system for current disk-based clusters anytime soon, I think that the idea of using such a system for large online data stores will continue to be increasingly prevalent.

Review of "FaRM: Fast Remote Memory"

The problem is definitely real - many workloads need to access a huge amount of data quickly, and typically inter-machine communication has been an enormous bottleneck. FaRM seems to take significant steps in resolving this.

The main ideas are: use RDMA instead of TCP/IP to hugely reduce communication overhead, and reduce the amount of distributed transactions that must be carried out by using lock-free reads and local, single-machine transactions by shipping computation to the data.

This is different from previous work primarily because of new hardware; RDMA only became possible over Ethernet (which is the standard in commodity hardware cluster deployments) recently, and TCP/IP was previously fast enough to keep up with disk, but now that RAM is becoming large enough to store significant amounts of data, TCP/IP is falling behind.

One trade-off is that distributed transactions are somewhat slow; however, they manage to get around this by providing ways to avoid distributed transactions as much as possible.

While the system they have built is impressively fast, it seems that the main contribution here is really RDMA over Ethernet, with FaRM doing a very good job of tuning itself to this. While I think RDMA over Ethernet will have a significant impact in 10 years, and maybe some of the ideas from this paper will be relevant, I don't think this will be a particularly significant paper in 10 years.

Review of "IOFlow: A Software-Defined Storage Architecture"

I am unsure how real of a problem they are solving here. While I can see how having flow control and speed guarantees may become useful, I haven't personally heard of this ever being an issue and am not sure how common this problem is, although I suppose it must be a real problem somewhere at Microsoft for them to have put so much effort into solving it.

The main idea is to use a centralized control application to manage policies at various layers throughout the stack that a distributed IO call must traverse (down into core of client, through network, through NIC of server into its file system), informing them of how they should treat different incoming requests (e.g. based on where the call originated from, what data the call is asking for, etc.).

This solution is different for a few reasons; first, it makes the assumption that the cluster is relatively small (hundreds of servers), which does not apply to the entire general case (e.g. it would not currently work on Amazon EC2). This is also different because it is assuming a somewhat new workload, consisting of numerous VMs living on a single node, and many of these nodes comprising a network cluster, with each VM owned by (potentially) a different user/application. This presents a somewhat new take on prioritization because of its multi-tenancy.

One fundamental trade-off is the use of a centralized control application - this limits the scalability of the system, since all of the servers will need to contact it for instructions/policies. Another is overall speed vs flexibility - overall throughput does decrease as a result of the IOFlow system, but it provides you with the power to guarantee a certain bandwidth for critical applications.

I don't think I see this paper being influential in 10 years - I'm not familiar with this issue as an overly pressing problem, though I can see the utility of IOFlows.

Sunday, September 20, 2015

Review of "Flat Datacenter Storage"

The problem is somewhat real - while GFS / HDFS currently do a pretty good job of supporting large-scale compute applications, they do have the issue of having a single-point bottleneck (master/namenode server) and, as the paper mentions, of requiring that applications exploit data locality.

The solution's main idea is to distribute all files across many disks, utilizing reads/writes from all of them to have very high throughput, while also removing any central bottleneck (reads/writes are done without contacting any central server).

This solution is different from previous work primarily because of the assumption of full network bisection bandwidth. Previously, this was a very unrealistic assumption, but they discuss that newer network switches are making this feasible. This allows for less emphasis on data locality.

There are some hard tradeoffs here; for one, the GUID identifier for blobs enables finding a node to contact without requiring a central server, but doesn't as easily enable the use of arbitrary path strings / directory hierarchy. Another tradeoff is the necessity for more expensive/complex network switches; while they discuss that full bisection bandwidth networks are now feasible, increasing this bandwidth is still a non-negligible expense.

I can see this paper being influential in 10 years; it seems that if it truly is becoming more feasible to have full bisection bandwidth networks, this will be even more true in the future, and some of the Hadoop clusters that are starting to be spun up are reaching the limits of what HDFS can handle with its single namenode. There seem to be some very interesting ideas here, plus the overall idea is relatively simple.

Review of "The Google File System"

The problem here is certainly real: most (all?) existing networked file systems simply were not designed to handle data at the scale of Google and other big data use cases. A new way of storing this data, with fundamentally different considerations, was definitely necessary.

The solution's main idea is to have a single master server which deals with all metadata and coordination (but does not store any data), plus numerous (up to thousands) of chunkservers which actually store the data (replicated across multiple chunkservers). Slightly loosened consistency models and lack of a full POSIX filesystem API allow for a more efficient file system under their assumptions about workload.

This solution is different because it makes very different assumptions about both workload and system than previous filesystems. Generally, file systems attempt to implement the full suite of POSIX file operations, provide pretty strong consistency, assume that machines fail infrequently, attempt to optimize for a wide range of cases, etc. GFS assumes that machines fail often, and makes strong assumptions about the way in which the file system will be used - mostly append only, very large files, few random reads, etc.

A fundamental trade-off here is flexibility vs. speed/efficiency/scalability. GFS gives up a lot in terms of flexibility as compared to a normal file system, specifically optimizing for a very specific use case, which allows for significant reduction in complexity and increase in scalability and efficiency. There is also a trade-off of simplicity vs efficiency having a single master node rather than some sort of distributed protocol (the master becomes a bottleneck).

I do think this paper will be influential in 10 years, and in fact has already certainly proved to be, as HDFS is still hugely commonly used in industry and is mostly primarily off of this paper.

Tuesday, September 15, 2015

Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

As with the other two papers we've read this week, this is addressing a real problem with the need for better streaming systems as stream processing becomes increasingly important and increasingly distributed.

This paper focuses more on the programming model ("Dataflow Model") rather than a specific implementation, and the main idea is what this Dataflow Model is able to represent, primarily windowing (including session windowing), and triggering to allow for the application to decide how to handle late data / when to output results.

This model was motivated by the desire to have clear semantics that are able to handle a very wide variety of use-cases, due to the authors encountering cases where current models were not well-suited to expressing certain desired pipelines, and especially that no single model was able to express all desired pipelines.

The paper very clearly addresses fundamental trade-offs and, specifically, does not address them in a certain direction in the model, leaving them in such a way that they can be tuned for different pipelines. They create the model in such a way to support as many different ends of the spectrum as possible - emit results very greedily and don't worry about having to fix them later, wait until all results arrive, and anywhere in between. One trade-off I see that they do not discuss is the abstraction layer of their model. They specifically design the model to ignore the distinction of batch vs. micro-batch vs. streaming, which allows the programmer the simplicity of considering them all to be the same, but as with any higher-level abstraction this may come with penalties, e.g. in terms of performance and optimizability.

This paper does prevent what seems to be a very well thought out flow model influenced by a lot of real world experience, and most notably the ideas of windows and triggers are pretty general and could be included in any future system. I am tempted to say that this paper could definitely be influential in the future, especially since it comes out of Google, which has been known for producing very influential research papers in the past as well.

Review of "High-throughput, low-latency, and exactly-once stream processing with Apache Flink"

Flink is addressing a real problem that all three of our papers are addressing this week: the increasing need for fast stream processing. The solution's main idea is the idea of distributed snapshots, a method based on the Chandy-Lamport to create a snapshot of the entire distributed system without the necessity for pausing everything.

The solution is one of many new methods being developed to make fast, hugely distributable streaming systems with strong fault tolerance guarantees. The field is growing rapidly, so many new ideas are being generated in this space, with Spark Streaming, Google Dataflow, and Flink all taking slightly different approaches.

There will always be trade-offs in terms of the different fault tolerance mechanisms used by these systems, which, as discussed in this paper, work in no small part to influence the rest of the system. The main trade-off here is the complexity of a single snapshot; if you are working with an application that requires a significant amount of state, each snapshot will be expensive, but as a trade-off, you can then schedule less frequent snapshots. But this will require longer failure recovery times, especially if you have a heavily loaded system, whereas micro-batch suffers from less of this problem.

Apache Flink seems to basically just be a real-world implementation of the algorithm described in the Chandy-Lamport paper, so I have doubts that this paper specifically will be influential in 10 years; if this type of system catches on, it will likely be the Chandy-Lamport paper that continues to be influential (despite its age), though the Apache Flink application itself may well be influential in the future.

Monday, September 14, 2015

Review of "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing"

Spark Streaming's D-Streams address an emerging problem; streaming computation that is distributed, while maintaining strong fault tolerance and straggler mitigation. Numerous streaming systems already exist that do similar things, but Spark Streaming does seem to be addressing a real problem with much better failure and straggler semantics.

The main idea, one very novel in the world of streaming, is to process all records as mini-batch computations rather than attempting to individually handle each incoming record as it arrives. This allows for the leveraging of a lot of techniques used in batch processing. It does, however, mean that minimum latency is around 500-2000ms, as opposed to many streaming systems that have less latency.

Previous streaming systems have all focused on ensuring that latency is as low as possible. This paper takes a different approach, assuming that in most real world systems latency of 1-2 seconds is perfectly acceptable, and leverages this assumption to enable the use of a very different set of techniques than traditional streaming systems.

There is a fundamental trade-off here of latency vs. a number of other things (bandwidth, fault tolerance, ease of integration with batch processing). By increasing the latency (but leaving it at a point assumed to be acceptable), bandwidth is higher than that of other streaming systems examined, streaming computation can more easily be integrated with batch computation, and the system becomes more fault tolerance (and is able to recover from failure more quickly).

I think the idea of mini-batch is certainly promising, though the streaming field has a lot of other competitors right now and it is certainly possible that others will prove to be much more successful / influential. Hard to say if I think I see this being influential.

Tuesday, September 8, 2015

Review of "Naiad: A Timely Dataflow System"

This paper definitely addresses a real problem in stream processing, though (as they admit) there are already tools to do most of the things Naiad aims to be, except that Naiad attempts to combine it together in a more cohesive and extensible manner. This is good, though it doesn't seem groundbreaking.

The solution's main idea is the "timely dataflow" concept, which uses stateful vertices with a simple interface that allows them to accept and output messages. By using timestamps which represent which portion of the data a message belongs to, they are able to allow for parallel processing and supply notifications to a vertex to alert it that there are no more messages for a given data period, allowing the vertex to do any necessary finalizing action. This allows for coordination where necessary (waiting for notification) but low-latency when it isn't necessary (can immediately transform and output received messages).

The solution seems to be different from previous work mainly in the sense that, although what they're trying to do can be done, a lot of the existing tools are not as mature as they would like and do not encompass as wide of a breadth as they would like to be integrated together.

The paper identifies some fundamental trade-offs, e.g. it notes that batch-processing systems get advantages from the longer-running nature of their computation and the lack of requirement for real-time updating (which adds overhead). It also notes the trade-off between fault tolerance schema; logging on every action may work better in some cases, and checkpointing at certain intervals may work better in others. They also discuss that micro-stragglers can be very detrimental to latency when working on such rapid timescales, as is apparent in Figure 6b, but note that because they use a stateful vertex model it would be nontrivial to perform speculative execution to prevent this issue. The stateful vertices provide a trade-off focusing on streamlining performance while reducing fault and straggler tolerance.

I am not sure if I see this being influential in 10 years. Being able to make live updates on iterative computations that are also interactively query-able is very useful and I can see that many people would want that ability, especially as we move forward in terms of increasing data collection, and the ideas about timely dataflow with their pointstamp / occurrence count may prove to be useful. At the same time, I don't see anything so novel that it makes me feel it will still be useful in 10 years. Hard to say. Maybe I'm just biased because they're working in C#.

Review of "Spark: Cluster Computing with Working Sets"

Spark addresses a very real problem with the previously available tools for running large-scale distributed algorithms that are of an iterative nature; the most commonly used, MapReduce, has no support for keeping data which will be used multiple times in memory, so it must be reloaded upon each iteration, which can significantly limit performance. Spark addresses this by introducing the concept of resilient distributed datasets (RDDs) which are cacheable, meaning that they can be loaded/computed a single time and then have numerous computations performed on them. Spark also introduces accumulators to allow for add-only communication/data aggregation between nodes, as well as broadcast values to achieve data persistence similar to RDDs for large amounts of data that will need to be accessed by all nodes, but the RDDs are the primary focus.

This solution is different from previous work in part because it is a relatively new problem, and in part because of new workloads. Even MapReduce is a relatively new paradigm, and was not built with iterative algorithms in mind. It was extensible enough to handle the new use cases (e.g. machine learning, logistic regression), but as we see in the paper's results it did so significantly less efficiently than possible.

There are a few trade-offs here. Using Scala rather than pure Java (also likely partially due to the overall increase in system complexity) means that there is a higher overhead (as compared to MapReduce) for applications which do not make use of Spark's caching power as seen in the significantly longer time to complete the first round of logistic regression (Section 5). This means that Spark will not work as a complete MapReduce replacement, but rather that each will be best suited to different applications. Another trade-off discussed in the paper is tuning the parameters of the RDDs to optimize for various things: storage cost/size, access speed, probability of losing part of it, and recompute cost. They discuss that part of their future work is to make RDDs user-tunable to optimize based on what makes the most sense for each specific application.

I can definitely see this paper being influential in 10 years. Machine learning has become a necessity that is employed in increasingly many areas, and most (all?) machine learning requires the iterative algorithms that Spark is so well suited for. The amounts of data being processed are increasing very rapidly, so the need for efficient distributed computation is continuing to become more and more vital, and Spark (despite its young age) is already seeing adoption in many major players in the technology industry.