CS294 Paper Review Blog: 2015

Wednesday, November 25, 2015

Review of "One Size Fits All"

Today, this paper seems to address something that is a nonissue - throughout the duration of my career in computer science I've been of the mindset that "one size fits all" has no place. It's strange to me to think how ubiquitous the DBMS was in times past - amazing that we were at a point when all of the systems I think of today as distinct - OLAP, OLTP, analytics, warehousing, etc. - were run through the same platform. I don't know if this paper in particular was influential in causing the shift to today's mindset, but in any case, the shift certainly occurred.

One interesting question that I think arises from this is - will there ever be a time when we have systems that are so efficient we can again return to a one size fits all paradigm? It currently doesn't seem likely, but that would seem to be the "holy grail" of sorts. The Hadoop ecosystem today does provide at least a unified platform that many such systems can interact upon, with a unified data later (e.g. HDFS, Tachyon) and unified schedulers (Mesos, YARN).

Monday, November 16, 2015

Review of "Jellyfish: Networking Data Centers Randomly"

The main idea of Jellyfish is to connect servers / switches completely randomly, rather than trying to conform to a regular structure such as the traditional tree / fat-tree structure. The primary motivation was the ease of incrementally adding new servers, but they found that the random layout actually increased bandwidth capacity when using the same hardware due to the paths between servers being shorter on average.

The VL2 paper also involved an element of randomness, though at the routing level rather than the physical connection level, which makes me wonder if there may be anything fundamental about the use of randomness in networking to avoid congestion and increase connectivity.

There are two main tradeoffs I see here: complexity of routing, and length/complexity of cabling. Since servers are no longer connected to nearby neighbors in a tree fashion, and may be connected to servers anywhere else in the data center, the average cable length may increase significantly, and the cabling follows a less regular structure so it may result in a "dread spaghetti monster" (authors' words). They discuss ways to solve this issue including clustering switches (which there will be many more as compared to servers for large clusters) in the middle of the data center, constraining most of the cabling to that area. They discuss a number of ways to deal with the routing complexity issue.

This idea is very interesting, and I am curious to see if in the 3 years since publication anyone has tried this.

Review of "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric"

PortLand tries to solve extremely similar problems as VL2: making the network appear flat and allowing VMs to migrate gracefully, reducing amount of configuration necessary, efficient communication between any pair of nodes. Again, this is definitely a real issue.

PortLand uses a "PMAC" (pseudo-MAC) address to abstract away physical location details; this is similar to the directory lookups used in VL2. It seems that these two systems are taking very similar approaches; add another layer of abstraction to make physical location transparent to the application developer. PortLand seems to place less emphasis on providing high bandwidth between any pair of servers, instead not requiring a change in network topology and focusing on fault tolerance (whereas VL2 adds a more dense linking structure to provide high bandwidth).

Since both of these papers are 7 years old, I am curious to know if either of these systems (or something similar) is in use anywhere. I'm sure some of the ideas have appeared by now, and it would be interesting to find which approach was more successful. It is my intuition that PortLand's approach with fewer network links required probably appears more often, though I don't think I have ever heard of any PMAC approach to a physical-layer-hiding abstraction.

Sunday, November 15, 2015

Review of "VL2: A Flexible and Scalable Data Center Network"

The issue of getting good network bandwidth across arbitrary machines within a data center is very important; task placement is often based on data locality rather than placing machines near each other, and having to juggle both of these concerns simultaneously is very difficult. Making the network more flat in terms of performance eases this issue.

They have a few central ideas: use Valiant Load Balancing to route flows (rather than packets, as in typical VLB) randomly across the network, use a lookup table to map IP addresses to physical locations such that machines can easily be relocated, and push work to the end nodes to reduce state and complexity of switches.

There are a number of tradeoffs; for example, using the directory lookup structure for IP addresses helps create the flat network illusion they work to achieve, but provides some overhead. Their networks require a fair amount more connections than standard topologies, e.g. having a full bipartite graph between the intermediate and aggregate nodes. This cost comes at the benefit of much improved performance and ease of programming/task scheduling layout.

I can see this being influential in the future; as data center computing becomes completely ubiquitous, we expect it to become easier, with more of its complexities hidden, and the flat network abstraction is very helpful in achieving this. The performance gains as opposed to a standard tree hierarchy are also substantial. I don't necessarily think this system in particular will take off, but the ideas seem important.

Monday, November 9, 2015

Review of Tao

As more data becomes represented in graph formats to capture the intrinsic connectivity of the data, the ability to store information in a way that is aware of this structure becomes more important.

The main idea of Tao is to provide a caching layer on top of a MySQL database which is graph-aware, intelligently fetching data based on its structure. It is essentially a replacement to memcache which is tuned specifically to graph-structured data - nothing too revolutionary.

One trade-off they made which I am surprised about is the use of MySQL as the backing storage engine. It seems to me that this translation layer from graph-structured to relational must impose unnecessary overheads, but it also means that transitioning was likely easier and they get the reliability guarantees of MySQL automatically.

I don't see this being influential in 10 years - there's uncertainty over whether we will want to continue to use graphs as the primary representation of data moving forward, and even if we do, it doesn't seem that this work is particularly novel.

Review of Neo

Neo provides a data storage solution built from the ground up to store graph-structured data, unlike Tao which simply acts as a layer on top of an inherently relational storage model.

While this seems like a good thing, the details provided in the paper about how performant the system is and how they achieved this are very sparse (essentially nonexistent). I would be very curious to know more about how this performs as opposed to traditional databases and NoSQL stores, and if it does perform well, how they managed to achieve this. It read almost like a marketing document more so than an academic document.

The traverser / navigational model is interesting, and I would be curious to see how more complex queries are written in this manner. If the data really is naturally graph-structured, it seems that it would probably be intuitive to write queries in this way, but without any examples I am unsure if that intuition is correct.

Wednesday, November 4, 2015

Review of "GraphX: Graph Processing in a Distributed Dataflow Framework"

I am not sure how frequent it is to need to be able to join graphs against other data when doing graph computations, but it does seem that it could be a common situation and GraphX can certainly help to solve this while still being extremely performant.

The main nugget is essentially to be able to run graph algorithms on top of Spark, which gives you fault tolerance for free, as well as allowing you to seamlessly intermingle graph data and other data that you may want to join against the graph data. One important thing done to make this successful was to optimize the common join-map-group pattern.

One trade-off is programming model; the vertex-program model of GraphLab and Pregel may be more intuitive for graph processing than fitting an extension of Spark's RDD API to the problem; I certainly had a little harder time reasoning about the program's behavior. Performance may also be an issue in some cases, but it seems that in general GraphX performs very well.

Review of "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs"

I thought this paper was pretty interesting - intuitively the power law distribution of natural graphs makes sense, but I would not have originally thought about it when working with something like GraphLab. I'm not sure exactly how general the power law distribution is, but it's definitely true for social networks, which are extremely important for graph processing in today's world, so this is definitely a worthwhile problem to solve.

There are two main ideas: first, separate the program explicitly into Gather / Apply / Scatter phases, allowing PowerGraph to optimize your program by having a better understanding of what needs to occur where. Second, partition the graph based on edges, not vertices, since this is more representative of the work that needs to be done.

The first trade-off I see is simply usability / programmability; it seems that it is a bit easier to reason about the programs which are written in GraphLab and Pregel than in PowerGraph due to the developer having to be more aware of the different phases in PowerGraph. However, this increase in complexity is relatively minor, and certainly seems worth the performance benefits in applicable situations.

I can definitely see this being impactful in 10 years - graph processing is a very hot subject right now, and this paper seems to introduce some ideas that seem very promising.

Tuesday, October 27, 2015

Review of "Materialization Optimizations for Feature Selection Workloads"

In the current state of machine learning, using a machine learning algorithm on your data often essentially comes down to deciding on what features to use, which can sometimes be extremely nonintuitive, and requires a great deal of testing. With machine learning becoming more ubiquitous all the time, being able to do this quickly is important.

The main ideas of this paper to speed up this process are: intelligent subsampling (since this is just to determine good features and not to actually make business decisions, exact results aren't necessary), materializing partial results to be able to reuse them on similar computations, and maintaining a cache of computed values (since much of the computation will be the same on each iteration of trying new features, with only a few things changed).

Subsampling comes with a fundamental trade-off in terms of speed vs accuracy, but this is easily tunable, and due to the intelligent nature of their subsampling they show that in some cases they can achieve an 89x speedup with only a 1% error, which is pretty impressive.

I am pretty surprise that no one has already done work on this, since besides the transformation materialization, these optimizations seem to be pretty intuitive, and I am somewhat skeptical that they are the first to apply this... I think that these techniques will be influential in 10 years, but whether or not this paper specifically will be, I am not sure.

Review of "Towards a Unified Architecture for in-RDBMS Analytics"

While I am not sure that running your data analytics in your RDBMS (which generally isn't designed for such a thing) is the best way to go about the problem, people are certainly doing it, so making this faster is definitely a valuable problem to tackle.

The main idea of this paper is to note that many of these analytics algorithms can be solved using IGD (incremental gradient descent). By leveraging this common solution mechanism, they can implement a framework which requires only small extensions to be able to run a wide variety of algorithms, making the development of new algorithms and applying them to new RDBMSes much easier. They also make clever use of data layout and parallelism.

Intuitively, it would seem that there should be a trade-off between performance and generality, with more specific implementations being more performant. This doesn't end up to be the case in their analysis, with their more general solution outperforming specific implementations. This may be more of a result of their other techniques; perhaps if they leveraged the techniques used to implement the general framework to fine-tune the individual algorithms, they could achieve even better performance at a loss of generality and ease of development.

Review of "Scaling Distributed Machine Learning with the Parameter Server"

Machine learning is becoming increasingly ubiquitous and complex, running over increasingly large datasets. Being able to run these algorithms (some of which don't lend themselves well to large-scale parallelism) in a distributed fashion quickly is increasingly important, so this is definitely a real problem.

The main idea is to use a centralized set of parameter servers which maintain the shared state of the system, coupled with a large number of worker nodes which actually perform computation. Worker nodes push state updates and pull down state. The system highly leverages the fact that machine learning algorithms are generally convergent and are not significantly disrupted by having stale state, so they don't require synchronous state updates.

One trade-off they discuss is convergence rate vs per-iteration performance. As state becomes more stale, things converge less slowly, but having more stale state allows you to do more computation asynchronously.

Sunday, October 25, 2015

Review of "Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary"

This paper aims to present a solution to the issue of maintaining consistency across multiple datacenters. As services become increasingly global, this becomes important to decrease latency to users across the globe, which also ensuring consistency.

The main idea is to separate operations into two categories: Blue (commutative, eventual consistency) and Red (strong consistency). They provide guidelines for how to determine what operations fall into each category, plus a system for forcing more operations to fit into the blue category: decomposing operations into generator and shadow operations, which can turn some non-commuting operators into commuting ones.

This work comes about most likely because the scale of geo-replication that exists now is significantly larger than in the past, and being able to maintain low latency (small consistency overheads) is very important - higher latency has a significant correlation to decreased per-user profits.

There are a number of trade-offs here. Using blue operations only guarantees eventual consistency, which still may come with issues ("eventual consistency is no consistency"), though it provides much higher speed. Breaking down operations into generator/shadow operations may grant much lower latency, but also makes things more difficult to reason about.

I can see this being influential in the future - services are becoming increasingly more global and people expect increasingly lower latencies, but application programmers also want consistency. This seems to provide a good framework for maintaining both of these things simultaneously.

Review of "CRDTs: Consistency without Concurrency Control"

Attempting to balance strong consistency models with low concurrency control overhead is a very important line of research because as systems grow larger our current concurrency control schemes have overheads which are often prohibitively high, but some reduced-consistency schemes make things difficult to reason about for application programmers and result in complex handling logic. Any work to bridge this gap is certainly solving a meaningful problem.

The main idea is to use CRDTs, commutative replicated data types, on which all operations commute, to allow for the construction of data structures without communication: if all operations can be applied in any order, just apply it locally and them disseminate the operation, since eventually it will reach all replicas and we don't mind if it was applied out-of-order. Note that this is still eventual consistency and makes no guarantees about how soon things will be consistent, just that they will eventually reach such a state.

I think this work was different from previous work partly because of the size of systems in today's world; concurrency control schemes which previously worked fine are reaching the limits of their scalability and it is important that we investigate new ways to achieve the same consistency guarantees in more scalable ways.

One trade-off is generality vs. speed; not everything can easily be represented as a CRDT. There is also the trade-off of ease of use; CRDTs are only eventually consistent, and this must be planned for as an application programmer.

I can definitely see this paper being influential moving forward--though I don't see CRDTs in this form being widely used, the ideas are important.

Saturday, October 24, 2015

Review of "Coordination Avoidance in Database Systems"

Scalability is hugely important in today's world, and communication/coordination is the bane of scaling. This paper works towards reducing the amount of coordination necessary for maintaining correctness, a very important problem.

The main idea is to analyze transaction sets against application invariants to determine when exactly coordination is necessary, in contrast to traditional database systems which serialize all transactions. By only coordinating when absolutely necessary, many transactions can run independently in parallel.

A trade-off here is complexity vs. speed. It is much easier to reason about transactions as completely serializable, and you don't need to write down all of your invariants, but this will often be worth it in a high performance setting.

This work seems very interesting - I feel that most literature I have seen uses the same concurrency scheme for all transactions, and the idea of predetermining which transactions actually need coordination seems like it will have a lot of practical benefits as systems scale larger and larger. I can definitely see this being influential in 10 years.

Wednesday, October 21, 2015

Review of "Shielding Applications from an Untrusted Cloud with Haven"

They are solving a very interesting problem: shielding application-level code from the OS that is executing it. I am not convinced that this is a real problem in today's world...

The main idea of their solution is to use hardware-provided instructions (Intel SGX) to be able to allocate protected memory regions (enclaves) within which the application's execution code is protected, even from the OS which is running that code. Haven takes a viewpoint that neither the OS nor the application trust each other, and provide interesting ways for them to provide services to each other despite this limitation.

This work is emerging because it is becoming increasingly more common to run your application on hardware that is managed by others, e.g. Microsoft Azure, Amazon EC2, etc. Previously, you managed the hardware, but were concerned about application-level possibly doing some harmful. Now, the hardware providers still need to be concerned with that same problem, but application-level code is also dealing with an outside entity and may want to be protected.

A big trade-off here is speed; the extra protection comes at a cost of higher latency because of things like system calls being more expensive, and generally using SGX extensions has a bit higher cost.

I don't really see this being overly influential in 10 years - like I said earlier, I am not convinced this is a real issue.

Tuesday, October 20, 2015

Review of "CryptDB: Protecting Confidentiality with Encrypted Query Processing"

This paper addresses an issue that has become increasingly more important as of late, viewable as an issue even to those not involved in the software community. Data breaches are becoming increasingly more common, and increasingly more devastating, so encryption of sensitive data is very important. Having a DB that makes this directly in to its storage seems very promising.

The main nugget is to store the data in an encrypted format inside of the DB, but store it such that SQL queries are still able to execute over the data. This is made possible because SQL has a specific subset of operations that it will run, and thus the encryption scheme can be aware of these operations. Though sometimes data will need to be decrypted for certain operations, CryptDB attempts to minimize the amount of data that is decrypted.

I cannot say exactly why this is new work - perhaps partially because security has become an increasingly larger concern as more data is stored online and data hacking has become increasingly widespread and damaging.

The primary trade-off here is resource cost vs security. A fully homomorphic cryptography scheme would provide even better security, since it could execute all of the necessary operations without decrypting data, but would be very prohibitively CPU intensive. CryptDB attempts to provide security while still providing reasonable real-world performance, and seems to do a good job at this.

I can see this being influential in 10 years -- companies are being slammed harder and harder by data breaches and anything that can remedy this must pique the interest of many large companies.

Monday, October 19, 2015

Review of "A Sampling Algebra for Aggregate Estimation"

While there is a great deal of work going into making databases fast so that you can process the wealth of data you collect in today's world, it still remains useful to do sampling to obtain a quick estimate of a result. But in today's world, you often have no idea (except perhaps an intuition) how accurate that data is. People have figured it out in some cases, but not in a way that is at all general. This paper serves to solve that, with a main contribution of the idea of a GUS (generalized uniform sampling) operator which can describe any sampling method, as well as the idea of Second Order Analytic (SOA) equivalence to denote the equivalence of two query plans in terms of expected value and variance. Combining these with supported relational operators provides a way to reason about and calculate error bounds for nearly any type of sampling.

It seems to be that this is new because as data volumes grow, it becomes ever more important to be able to sample and sample confidently.

There is a trade-off here in that the generality of the GUS also makes it harder to reason about - I certainly had trouble really understanding what things meant.

I can see this being influential in 10 years; they provide a general use tool for others to use in a space that can only become increasingly more important as we move forward.

Friday, October 9, 2015

Review of "Succinct: Enabling Queries on Compressed Data"

Everyone needs to store enormous amounts of data these days, and everyone wants to be able to access it quickly. Succinct presents the problem that storing large quantities of data means you want to be as space-efficient as possible, but to access it quickly you (generally) build indices, which are not at all space-efficient. I am unsure how frequent of a problem this is in practice, but I imagine that there are many use cases for Succinct where it would greatly advance the current state of the art.

The main idea of Succinct is to essentially build an indexing system into the compression. This eliminates the need for space-costly secondary indexes, while also providing fast search capabilities.

I think this is probably different from previous work because the quantities of data that are now being stored in e.g. NoSQL systems are blooming hugely, and in the past typical systems did not need to rely as heavily on compression and space awareness, but the increase in data volume has made it a very necessary feature. Yet, at the same time, the data still needs to be accessed quickly, leading to Succinct.

The trade-off here is, of course, speed vs. space. Succinct falls into a pretty happy middle ground between the two, though it still falls short in some areas, e.g. large sequential reads.

I can see this being influential in the future - this seems to be a very new way of thinking about compressed storage that should be very useful.

Review of "Mining Modern Repositories with Elasticsearch"

As companies gain increasingly larger amounts of data, they need scalable solutions to deal with that data. Most of what we have looked at in this class has been somewhat complex - how to process large amounts of data, etc. - but we can't forget about one of the simplest applications: simply finding things within this wealth of data, aka search.

This is different from previous offerings simply because of the volume of data - local searching and indexing is by no means a new concept, so the main contribution of Elasticsearch was intelligently distributing and sharding this data to be able to access it quickly in a scalable manner.

The trade-off is primarily complexity vs speed. Elasticsearch is pretty simple: you get textual searches on certain fields, even letting you define a range for e.g. numeric types, but that's about it. There's no joining, aggregation, etc. On the other hand, this enables it to run extremely quickly on very large datasets.

This paper in particular I don't see being influential in 10 years, but I do see Elasticsearch in general (or at least, some derivative) continuing to be very important moving forward.

Monday, October 5, 2015

Review of "Spark SQL: Relational Data Processing in Spark"

Spark SQL solves two interesting problems simultaneously: better support for declarative / SQL programming on nested / semistructured / big data, and better integration of procedural with declarative programming.

The main idea is to provide a DataFrames API which abstracts a data source (RDD, Hive table, CSV file, etc) and provides relational-style operations which can be optimized. This allows developers to seamlessly operate on many different data storages, including Java/Python objects. They also introduce the Catalyst optimizer, a highly extensible query optimizer. One important thing to notice, in my opinion, is how flexibly the entire system was built: it is easy to define new data sources, optimizations, UDTs, and UDFs, all of which play nicely together.

One thing about this paper that I think will continue to be influential is the idea of mixing declarative / SQL programming with procedural programming. While it has always been possible to some extent using UDFs, Spark SQL provides a much more integrated intermingling that seems both easier to work with and more flexible.

Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

Dremel was (one of?) the first real solutions for interactive-latency SQL queries running side-by-side with MapReduce and other Hadoop-type applications, solving the very problem of being able to run interactive queries on your data without having to export it away from where batch processing is occurring.

Two main ideas used are to use columnar storage for representing nested data (which is handled natively), and to use a "serving tree" style architecture to distribute queries among a great deal of "leaf" nodes which scan data from disk and a hierarchy of intermediate nodes to perform aggregation. This also differed from Hive and Pig because it did not build on top of MapReduce jobs.

This is different from previous work because previously, data stored e.g. on GFS was being used primarily for batch computation. However, as increasingly more data was moved onto GFS and it became a "one stop shop" of sorts for data, it became more desirable to be able to run interactive-latency queries over GFS without the need to export data into a more traditional DBMS. This is sort of a combination of new system requirements and new workload: data volume was overwhelming traditional DBMSes, and new workloads required lower latency than existing solutions.

One trade off is the use of columnar vs. record-oriented storage. As their experiments show, when numerous fields of a record are accessed in a query, it can be faster to use record-oriented storage. However, as the more common case is to access only a small subset of the fields in a record, columnar storage turns out to be much more suited to their use case.

This paper has certainly been very influential; ideas from Dremel have almost certainly been incorporated into a number of other systems for similarly performing SQL-like queries over data. The use of columnar data format, a somewhat new idea especially for nested data, has also gained a great deal of traction, possibly in part due to Dremel.

Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

Impala aims to solve a very real problem - how do we get the most out of all of the data we have stored in Hadoop (i.e. HBase or HDFS)? It does so in the direction of attempting to optimize SQL-type queries in an interactive, low-latency manner (as opposed to the higher-latency, bulk processing semantics of Apache Hive). The hope is that you will no longer need to export data from Hadoop to traditional RDBMS data warehousing for analytics and business intelligence.

Impala, rather than owing its speed to one new core idea, seems to me to be built on top of a number of number of ideas that work together to provide overall efficiency gains. They use code generation instead of interpretation to provide more efficient queries at the instruction level (previously employed by e.g. Hekaton and others), remove centralized control from the query path (statestored and catalogd both push changes rather than the individual Impala daemons asking for information on each query), seem to provide better query optimization than previous SQL-on-Hadoop systems, and use short-circuit local reads to bypass the full HDFS DataNode protocol (a relatively new addition to Hadoop that other systems can easily take advantage of as well), allowing for much faster local disk reads.

I see Impala as a next step past Hive in the SQL-on-Hadoop world; Hive was an initial attempt to bring SQL to Hadoop, but was never really that great. Yet, despite that, it gained a lot of users, a testament to how badly end users want SQL on Hadoop. I think this means that it is natural that now many systems are addressing this problem (SparkSQL, Dremel, Drill, Presto, etc).

One trade-off I recognize: their metadata push model means that metadata may be stale, but they disregard this as a non-issue since the query plans are created on a single node using consistent metadata.

I don't see Impala itself being particularly influential in 10 years - while it is a good implementation, it doesn't seem to have a large number of new ideas.

Sunday, October 4, 2015

Review of "Cassandra - A Decentralized Structured Storage System"

As with Bigtable, Cassandra is solving a real issue of storing large amounts of structured data in the middleground between KV stores and relational databases, at very large scale. This is becoming increasingly more important.

The main idea of Cassandra is to decentralize all of the processing, moving a way from a typical master-slave architecture. Decision making, e.g. failure detection, is all done in a decentralized manner through the use of Gossip. This removes reliance on a single point of failure or contention.

While this removal of reliance on a master is nice, it makes me wonder, is it really necessary? We saw in Bigtable that a master-server system can be made very efficient by making the operations that it carries out infrequent (e.g. in Bigtable most requests don't involve the master). Is the extra complexity of the decentralized system worth it? Cassandra's widespread adoption may point to yes, but I am not entirely convinced.

I don't particularly see this being highly influential in 10 years - while Cassandra is a well designed system incorporating many good ideas, it didn't seem that it had many very original ideas, just a well put-together set of existing ideas.

Review of "Bigtable: A Distributed Storage System for Structured Data"

Bigtable addresses a problem that, although it has been addressed many times I many different ways, is still real: how best to store data at an enormous scale. Part of why it is not a completely solved problem is that different workloads and use cases demand different requirements.

The main idea is to use an distributed and immutable storage system in which each server is responsible for a number of data tablets, each of which is comprised of a number of SSTable files which actually store the data. These are immutable, and merges across them (as well as with the running log) give the current view of the tablet. Interestingly, metadata is stored as a specific table but otherwise does not differ from data in the system.

This was different partially because of the "Google scale" problem, requiring more scalability than most systems appearing previously. Many systems attempted to be too expressive (full relational) or not expressive enough (simple key value store), and Google decided it needed a good middle ground.

As discussed above, there is a fundamental trade off in terms of expressitivity: more expressive power means you can reason about and structure your data more fully, but is also more difficulty to scale. Google chose a middle ground approach with Bigtable. There is also a trade off with their decision to use immutable data storage. This makes writes faster (since they go a log and aren't in place) and simplifies the system architecture, but adds overheads on reads (the necessity to merge) and a background GC overhead.

It has been nearly 10 years since this paper was written, and it is still influential in the sense that Bigtable is still in use. But, it does seem tailored towards Google's needs, and I don't think it has been extremely influential in other areas.

Tuesday, September 29, 2015

Review of "Omega: flexible, scalable schedulers for large compute clusters"

As with the DRF paper, this is a real problem: heterogeneous scheduling in large compute clusters will only become more real as clusters become larger and workloads more diverse.

The main idea is to have multiple, completely independent schedulers all running in parallel and all able to access the entire cluster. Transactions to shared state are used to fully claim a resource, but schedulers are allowed to preempt each other.

This is different from previous work partly because the scale of clusters is reaching the point where the easiest design (a monolithic scheduler) is no longer viable. This is also different from newer things like Mesos because it is very specifically tailored to Google's business-needs driven approach and relies heavily on all schedulers in the system being written to play nicely with each other, which is not the view that is normally taken (usually, more focus on fairness between schedulers / users).

The scheduler identifies some hard trade-offs. For one, allowing schedulers to all attempt to take resources in parallel sometimes results in them not being able to do so, causing their attempted transaction to abort, which adds some overhead. There is also an enormous trade-off here in the sense of faster, more distributed scheduling vs less centralized control -- this works fine for Google since they have written their schedulers to all play nicely together, but will not always be the case.

I don't think I see this being influential in 10 years outside of Google -- while the concepts introduced are certainly interesting, I don't know that they will extend well beyond their walls.

Review of "Dominant Resource Fairness: Fair Allocation of Heterogeneous Resources in Datacenters"

This is definitely a real problem - there are many situations where many users need to share a large amount of cluster resources, and often the resources required by different tasks are extremely different. Scheduling them efficiently can provide large performance gains.

The main idea is to first consider a user's dominant resource, the one that requires the largest percentage of the cluster's resources (e.g. CPU might be a dominant resource if the user requested an amount of CPU that was a larger fraction of the cluster's CPU resources than the percentage amount of RAM it requested). To allocate by DRF, simply continue to allocate one task to the user with minimum dominant share (its share of its dominant resource).

This solution is different from previous work because the problem of large-scale cluster scheduling is relatively new, and becoming more of an issue only in more recent times.

The paper does identify a hard trade-off in DRF (and, they discuss, generally any scheduling algorithm) between resource monotonicity and the share guarantee / Pareto efficiency, giving an example of why both can't be satisfied simultaneously. But, they discuss that this is a trade-off they make very willingly, since new resource addition is a very infrequent event.

I don't know if I really see this being influential in 10 years. It was published 5 years ago, and has this actually been implemented anywhere? It seems like it is a good scheduling policy, but does it have a significant performance benefit over existing solutions?

Wednesday, September 23, 2015

Review of "Apache Hadoop YARN: Yet Another Resource Negotiator"

This is definitely a real problem - addressing two issues simultaneously, the scalability of the current Hadoop/MapReduce architecture along with the sharing of resources between MapReduce and other frameworks.

The solutions main idea is to have a central ResourceManager that controls global state, plus a per-application/job ApplicationMaster that tracks all of the state for that job. This allows different ApplicationMasters to run different frameworks, and reduces the load on the ResourceManager, since it has to manage a small amount of global state and the ApplicationMasters deal with more complex scheduler, load balancing, fault tolerance, etc.

This solution is different from previous work for two reasons. First, workloads and clusters have become larger than originally envisioned when Hadoop/MapReduce was architected, and clusters had reached their limitations on the original architecture. Second, more frameworks want to be run on the same cluster as more applications move to cluster-based processing, so MapReduce needs to play more nicely with and Hadoop needs to provide a better execution engine for these new frameworks.

There are definitely some fundamental trade-offs here. Introducing per-job ApplicationMasters increases overhead a little because there is more communication and latency between the RM -> AM. However, this reduces load on the central RM, providing better overall scaling. Another trade-off is the flexibility of YARN vs the features it provides; they note that creating applications to run on YARN is not really an easy task because you have to be concerned with all of the intricacies of load balancing, fault tolerance, etc., but this low level interface allows for very general models to run on YARN. The compromise is other solutions such as REEF that enable you to more easily build applications on YARN, taking care of some of these issues for you.

I think this will likely be influential in 10 years, in part just because I think the Hadoop ecosystem will still be around in 10 years and even though YARN may no longer be the resource manager of choice by then, it will certainly influence whatever succeeds it.

Review of "Kubernetes: Open Source Container Cluster Orchestration"

Yes, this is a real problem. As discussed with regard to Mesos, having multiple applications/frameworks sharing the same cluster resources can be very important in today's world where nearly all applications at a company like Google need to run on a scale that requires large numbers of nodes.

The main idea is to run each application within a container, which are grouped together into pods (which are scheduled together on the same machine) to enable e.g. local resource sharing. Developers can also use labels to group pods/containers into subsets of an application, e.g. different builds, different jobs, etc. This model allows applications to be scheduled and run independently of each other (even if they may be on the same machine) while also giving developers flexibility to control colocation when desirable.

This isn't that different from previous work, and seems mainly an evolution of previous systems. Containers like Docker provide some of the same ideas, and Google's previous system, Borg, seems to have very similar ideas in a less configurable and flexible manner.

There will always be a trade-off in terms of complexity of the containering system (e.g. how much protection it provides) and how much overhead it causes. Google's blog post does not make it clear what kind of overhead Kubernetes imposes, so it is difficult to gauge where it falls on the spectrum.

I am tempted to say that it might be influential in 10 years only for the fact that is a large project coming out of Google that is being open sourced, and Google has enough clout that generally large software projects it throws resources behind do well. It will remain to see, however, if Kubernetes will gain traction when Mesos and YARN already integrate nicely with the hugely popular Hadoop ecosystem and provide some similar features, though Kubernetes does seem to provide some higher level constructs which may be useful.

Review of "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center"

This is definitely a real issue - increasingly more applications are migrating to being run over large distributed commodity clusters, and enabling them to share the same cluster despite operating on different frameworks is an important task that should benefit data sharing and utilization.

The main idea is to have Mesos operate as a thin resource management layer underneath the existing frameworks, providing fine-grained resource offers to frameworks, which encapsulate a certain amount of resources on the cluster. A framework can then decide which resources specifically to accept, enabling the framework to enforce e.g. data locality.

One reason this is different is that previously, the type of applications run on commodity clusters were relatively restricted, and so the Hadoop/MapReduce model was sufficient as a framework. However, as more applications are reaching the scale of needing to run in a distributed manner, the MapReduce paradigm's limits have been reached and other frameworks are necessary to compete, and running these together on the same cluster is a relatively new issue.

One hard trade-off is the focus on the simplicity of Mesos. This was done to make it very adaptable and scalable, but comes with a few downsides. One example is having Mesos offer resources to the framework and allowing the framework to reject rather than having the framework request specific resource requirements; this allows more extensibility in terms of flexible resource requirements, but means that a framework may need to wait a long time. Also, Mesos still has the frameworks make lots of scheduling decisions rather than having Mesos be a complete global scheduler; again, this means frameworks can do more complex scheduling with Mesos needing to become complex, but means that utilization may be slightly lower than optimal.

I am not sure if this paper will be particularly influential in 10 years. I think cluster resource sharing will become an increasingly more important problem, and Mesos definitely seems to be an excellent implementation of this, but the ideas presented seem relatively simple, with the main concepts being resource offers and a very thin client.

Monday, September 21, 2015

Review of "The Case for RAMCloud"

This paper definitely discussed a real problem - applications increasingly want to load more complex data quickly, have a larger total amount of data, and want all of this in a redundant manner.

The solution's main idea is to implement a storage system fully in DRAM (possibly with SSDs or disks for increased durability) to provide very low latency, high bandwidth storage.

This is different from previous work because, at the time of publication, DRAM was just starting to reach the point where you could install large enough amounts on a server, at a cost effective price, to be able to store significant amounts of data purely in DRAM.

There are definitely some fundamental trade-offs here. DRAM does not provide the same durability guarantees as disk/SSD; even with replication, non-volatile storage is necessary as a backup in case of e.g. a power outage at a data center. It also costs significantly more, and requires more energy per bit of storage, than disks/SSDs. But, they run at significantly lower latency and higher bandwidth.

I do think this will be influential in 10 years -- it has already been 6 years since its publication and I have seen this paper discussed in some of the other modern papers we have read. Also, DRAM is significantly cheaper and more spacious than it was at the time of publication, making this type of system even more viable in today's world. While the volume of data collected has also increased hugely, meaning that I don't see RAMClouds being a full replacement system for current disk-based clusters anytime soon, I think that the idea of using such a system for large online data stores will continue to be increasingly prevalent.

Review of "FaRM: Fast Remote Memory"

The problem is definitely real - many workloads need to access a huge amount of data quickly, and typically inter-machine communication has been an enormous bottleneck. FaRM seems to take significant steps in resolving this.

The main ideas are: use RDMA instead of TCP/IP to hugely reduce communication overhead, and reduce the amount of distributed transactions that must be carried out by using lock-free reads and local, single-machine transactions by shipping computation to the data.

This is different from previous work primarily because of new hardware; RDMA only became possible over Ethernet (which is the standard in commodity hardware cluster deployments) recently, and TCP/IP was previously fast enough to keep up with disk, but now that RAM is becoming large enough to store significant amounts of data, TCP/IP is falling behind.

One trade-off is that distributed transactions are somewhat slow; however, they manage to get around this by providing ways to avoid distributed transactions as much as possible.

While the system they have built is impressively fast, it seems that the main contribution here is really RDMA over Ethernet, with FaRM doing a very good job of tuning itself to this. While I think RDMA over Ethernet will have a significant impact in 10 years, and maybe some of the ideas from this paper will be relevant, I don't think this will be a particularly significant paper in 10 years.

Review of "IOFlow: A Software-Defined Storage Architecture"

I am unsure how real of a problem they are solving here. While I can see how having flow control and speed guarantees may become useful, I haven't personally heard of this ever being an issue and am not sure how common this problem is, although I suppose it must be a real problem somewhere at Microsoft for them to have put so much effort into solving it.

The main idea is to use a centralized control application to manage policies at various layers throughout the stack that a distributed IO call must traverse (down into core of client, through network, through NIC of server into its file system), informing them of how they should treat different incoming requests (e.g. based on where the call originated from, what data the call is asking for, etc.).

This solution is different for a few reasons; first, it makes the assumption that the cluster is relatively small (hundreds of servers), which does not apply to the entire general case (e.g. it would not currently work on Amazon EC2). This is also different because it is assuming a somewhat new workload, consisting of numerous VMs living on a single node, and many of these nodes comprising a network cluster, with each VM owned by (potentially) a different user/application. This presents a somewhat new take on prioritization because of its multi-tenancy.

One fundamental trade-off is the use of a centralized control application - this limits the scalability of the system, since all of the servers will need to contact it for instructions/policies. Another is overall speed vs flexibility - overall throughput does decrease as a result of the IOFlow system, but it provides you with the power to guarantee a certain bandwidth for critical applications.

I don't think I see this paper being influential in 10 years - I'm not familiar with this issue as an overly pressing problem, though I can see the utility of IOFlows.

Sunday, September 20, 2015

Review of "Flat Datacenter Storage"

The problem is somewhat real - while GFS / HDFS currently do a pretty good job of supporting large-scale compute applications, they do have the issue of having a single-point bottleneck (master/namenode server) and, as the paper mentions, of requiring that applications exploit data locality.

The solution's main idea is to distribute all files across many disks, utilizing reads/writes from all of them to have very high throughput, while also removing any central bottleneck (reads/writes are done without contacting any central server).

This solution is different from previous work primarily because of the assumption of full network bisection bandwidth. Previously, this was a very unrealistic assumption, but they discuss that newer network switches are making this feasible. This allows for less emphasis on data locality.

There are some hard tradeoffs here; for one, the GUID identifier for blobs enables finding a node to contact without requiring a central server, but doesn't as easily enable the use of arbitrary path strings / directory hierarchy. Another tradeoff is the necessity for more expensive/complex network switches; while they discuss that full bisection bandwidth networks are now feasible, increasing this bandwidth is still a non-negligible expense.

I can see this paper being influential in 10 years; it seems that if it truly is becoming more feasible to have full bisection bandwidth networks, this will be even more true in the future, and some of the Hadoop clusters that are starting to be spun up are reaching the limits of what HDFS can handle with its single namenode. There seem to be some very interesting ideas here, plus the overall idea is relatively simple.

Review of "The Google File System"

The problem here is certainly real: most (all?) existing networked file systems simply were not designed to handle data at the scale of Google and other big data use cases. A new way of storing this data, with fundamentally different considerations, was definitely necessary.

The solution's main idea is to have a single master server which deals with all metadata and coordination (but does not store any data), plus numerous (up to thousands) of chunkservers which actually store the data (replicated across multiple chunkservers). Slightly loosened consistency models and lack of a full POSIX filesystem API allow for a more efficient file system under their assumptions about workload.

This solution is different because it makes very different assumptions about both workload and system than previous filesystems. Generally, file systems attempt to implement the full suite of POSIX file operations, provide pretty strong consistency, assume that machines fail infrequently, attempt to optimize for a wide range of cases, etc. GFS assumes that machines fail often, and makes strong assumptions about the way in which the file system will be used - mostly append only, very large files, few random reads, etc.

A fundamental trade-off here is flexibility vs. speed/efficiency/scalability. GFS gives up a lot in terms of flexibility as compared to a normal file system, specifically optimizing for a very specific use case, which allows for significant reduction in complexity and increase in scalability and efficiency. There is also a trade-off of simplicity vs efficiency having a single master node rather than some sort of distributed protocol (the master becomes a bottleneck).

I do think this paper will be influential in 10 years, and in fact has already certainly proved to be, as HDFS is still hugely commonly used in industry and is mostly primarily off of this paper.

Tuesday, September 15, 2015

Review of "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing"

As with the other two papers we've read this week, this is addressing a real problem with the need for better streaming systems as stream processing becomes increasingly important and increasingly distributed.

This paper focuses more on the programming model ("Dataflow Model") rather than a specific implementation, and the main idea is what this Dataflow Model is able to represent, primarily windowing (including session windowing), and triggering to allow for the application to decide how to handle late data / when to output results.

This model was motivated by the desire to have clear semantics that are able to handle a very wide variety of use-cases, due to the authors encountering cases where current models were not well-suited to expressing certain desired pipelines, and especially that no single model was able to express all desired pipelines.

The paper very clearly addresses fundamental trade-offs and, specifically, does not address them in a certain direction in the model, leaving them in such a way that they can be tuned for different pipelines. They create the model in such a way to support as many different ends of the spectrum as possible - emit results very greedily and don't worry about having to fix them later, wait until all results arrive, and anywhere in between. One trade-off I see that they do not discuss is the abstraction layer of their model. They specifically design the model to ignore the distinction of batch vs. micro-batch vs. streaming, which allows the programmer the simplicity of considering them all to be the same, but as with any higher-level abstraction this may come with penalties, e.g. in terms of performance and optimizability.

This paper does prevent what seems to be a very well thought out flow model influenced by a lot of real world experience, and most notably the ideas of windows and triggers are pretty general and could be included in any future system. I am tempted to say that this paper could definitely be influential in the future, especially since it comes out of Google, which has been known for producing very influential research papers in the past as well.

Review of "High-throughput, low-latency, and exactly-once stream processing with Apache Flink"

Flink is addressing a real problem that all three of our papers are addressing this week: the increasing need for fast stream processing. The solution's main idea is the idea of distributed snapshots, a method based on the Chandy-Lamport to create a snapshot of the entire distributed system without the necessity for pausing everything.

The solution is one of many new methods being developed to make fast, hugely distributable streaming systems with strong fault tolerance guarantees. The field is growing rapidly, so many new ideas are being generated in this space, with Spark Streaming, Google Dataflow, and Flink all taking slightly different approaches.

There will always be trade-offs in terms of the different fault tolerance mechanisms used by these systems, which, as discussed in this paper, work in no small part to influence the rest of the system. The main trade-off here is the complexity of a single snapshot; if you are working with an application that requires a significant amount of state, each snapshot will be expensive, but as a trade-off, you can then schedule less frequent snapshots. But this will require longer failure recovery times, especially if you have a heavily loaded system, whereas micro-batch suffers from less of this problem.

Apache Flink seems to basically just be a real-world implementation of the algorithm described in the Chandy-Lamport paper, so I have doubts that this paper specifically will be influential in 10 years; if this type of system catches on, it will likely be the Chandy-Lamport paper that continues to be influential (despite its age), though the Apache Flink application itself may well be influential in the future.

Monday, September 14, 2015

Review of "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing"

Spark Streaming's D-Streams address an emerging problem; streaming computation that is distributed, while maintaining strong fault tolerance and straggler mitigation. Numerous streaming systems already exist that do similar things, but Spark Streaming does seem to be addressing a real problem with much better failure and straggler semantics.

The main idea, one very novel in the world of streaming, is to process all records as mini-batch computations rather than attempting to individually handle each incoming record as it arrives. This allows for the leveraging of a lot of techniques used in batch processing. It does, however, mean that minimum latency is around 500-2000ms, as opposed to many streaming systems that have less latency.

Previous streaming systems have all focused on ensuring that latency is as low as possible. This paper takes a different approach, assuming that in most real world systems latency of 1-2 seconds is perfectly acceptable, and leverages this assumption to enable the use of a very different set of techniques than traditional streaming systems.

There is a fundamental trade-off here of latency vs. a number of other things (bandwidth, fault tolerance, ease of integration with batch processing). By increasing the latency (but leaving it at a point assumed to be acceptable), bandwidth is higher than that of other streaming systems examined, streaming computation can more easily be integrated with batch computation, and the system becomes more fault tolerance (and is able to recover from failure more quickly).

I think the idea of mini-batch is certainly promising, though the streaming field has a lot of other competitors right now and it is certainly possible that others will prove to be much more successful / influential. Hard to say if I think I see this being influential.

Tuesday, September 8, 2015

Review of "Naiad: A Timely Dataflow System"

This paper definitely addresses a real problem in stream processing, though (as they admit) there are already tools to do most of the things Naiad aims to be, except that Naiad attempts to combine it together in a more cohesive and extensible manner. This is good, though it doesn't seem groundbreaking.

The solution's main idea is the "timely dataflow" concept, which uses stateful vertices with a simple interface that allows them to accept and output messages. By using timestamps which represent which portion of the data a message belongs to, they are able to allow for parallel processing and supply notifications to a vertex to alert it that there are no more messages for a given data period, allowing the vertex to do any necessary finalizing action. This allows for coordination where necessary (waiting for notification) but low-latency when it isn't necessary (can immediately transform and output received messages).

The solution seems to be different from previous work mainly in the sense that, although what they're trying to do can be done, a lot of the existing tools are not as mature as they would like and do not encompass as wide of a breadth as they would like to be integrated together.

The paper identifies some fundamental trade-offs, e.g. it notes that batch-processing systems get advantages from the longer-running nature of their computation and the lack of requirement for real-time updating (which adds overhead). It also notes the trade-off between fault tolerance schema; logging on every action may work better in some cases, and checkpointing at certain intervals may work better in others. They also discuss that micro-stragglers can be very detrimental to latency when working on such rapid timescales, as is apparent in Figure 6b, but note that because they use a stateful vertex model it would be nontrivial to perform speculative execution to prevent this issue. The stateful vertices provide a trade-off focusing on streamlining performance while reducing fault and straggler tolerance.

I am not sure if I see this being influential in 10 years. Being able to make live updates on iterative computations that are also interactively query-able is very useful and I can see that many people would want that ability, especially as we move forward in terms of increasing data collection, and the ideas about timely dataflow with their pointstamp / occurrence count may prove to be useful. At the same time, I don't see anything so novel that it makes me feel it will still be useful in 10 years. Hard to say. Maybe I'm just biased because they're working in C#.

Review of "Spark: Cluster Computing with Working Sets"

Spark addresses a very real problem with the previously available tools for running large-scale distributed algorithms that are of an iterative nature; the most commonly used, MapReduce, has no support for keeping data which will be used multiple times in memory, so it must be reloaded upon each iteration, which can significantly limit performance. Spark addresses this by introducing the concept of resilient distributed datasets (RDDs) which are cacheable, meaning that they can be loaded/computed a single time and then have numerous computations performed on them. Spark also introduces accumulators to allow for add-only communication/data aggregation between nodes, as well as broadcast values to achieve data persistence similar to RDDs for large amounts of data that will need to be accessed by all nodes, but the RDDs are the primary focus.

This solution is different from previous work in part because it is a relatively new problem, and in part because of new workloads. Even MapReduce is a relatively new paradigm, and was not built with iterative algorithms in mind. It was extensible enough to handle the new use cases (e.g. machine learning, logistic regression), but as we see in the paper's results it did so significantly less efficiently than possible.

There are a few trade-offs here. Using Scala rather than pure Java (also likely partially due to the overall increase in system complexity) means that there is a higher overhead (as compared to MapReduce) for applications which do not make use of Spark's caching power as seen in the significantly longer time to complete the first round of logistic regression (Section 5). This means that Spark will not work as a complete MapReduce replacement, but rather that each will be best suited to different applications. Another trade-off discussed in the paper is tuning the parameters of the RDDs to optimize for various things: storage cost/size, access speed, probability of losing part of it, and recompute cost. They discuss that part of their future work is to make RDDs user-tunable to optimize based on what makes the most sense for each specific application.

I can definitely see this paper being influential in 10 years. Machine learning has become a necessity that is employed in increasingly many areas, and most (all?) machine learning requires the iterative algorithms that Spark is so well suited for. The amounts of data being processed are increasing very rapidly, so the need for efficient distributed computation is continuing to become more and more vital, and Spark (despite its young age) is already seeing adoption in many major players in the technology industry.