CS294 Paper Review Blog: October 2015

Tuesday, October 27, 2015

Review of "Materialization Optimizations for Feature Selection Workloads"

In the current state of machine learning, using a machine learning algorithm on your data often essentially comes down to deciding on what features to use, which can sometimes be extremely nonintuitive, and requires a great deal of testing. With machine learning becoming more ubiquitous all the time, being able to do this quickly is important.

The main ideas of this paper to speed up this process are: intelligent subsampling (since this is just to determine good features and not to actually make business decisions, exact results aren't necessary), materializing partial results to be able to reuse them on similar computations, and maintaining a cache of computed values (since much of the computation will be the same on each iteration of trying new features, with only a few things changed).

Subsampling comes with a fundamental trade-off in terms of speed vs accuracy, but this is easily tunable, and due to the intelligent nature of their subsampling they show that in some cases they can achieve an 89x speedup with only a 1% error, which is pretty impressive.

I am pretty surprise that no one has already done work on this, since besides the transformation materialization, these optimizations seem to be pretty intuitive, and I am somewhat skeptical that they are the first to apply this... I think that these techniques will be influential in 10 years, but whether or not this paper specifically will be, I am not sure.

Review of "Towards a Unified Architecture for in-RDBMS Analytics"

While I am not sure that running your data analytics in your RDBMS (which generally isn't designed for such a thing) is the best way to go about the problem, people are certainly doing it, so making this faster is definitely a valuable problem to tackle.

The main idea of this paper is to note that many of these analytics algorithms can be solved using IGD (incremental gradient descent). By leveraging this common solution mechanism, they can implement a framework which requires only small extensions to be able to run a wide variety of algorithms, making the development of new algorithms and applying them to new RDBMSes much easier. They also make clever use of data layout and parallelism.

Intuitively, it would seem that there should be a trade-off between performance and generality, with more specific implementations being more performant. This doesn't end up to be the case in their analysis, with their more general solution outperforming specific implementations. This may be more of a result of their other techniques; perhaps if they leveraged the techniques used to implement the general framework to fine-tune the individual algorithms, they could achieve even better performance at a loss of generality and ease of development.

Review of "Scaling Distributed Machine Learning with the Parameter Server"

Machine learning is becoming increasingly ubiquitous and complex, running over increasingly large datasets. Being able to run these algorithms (some of which don't lend themselves well to large-scale parallelism) in a distributed fashion quickly is increasingly important, so this is definitely a real problem.

The main idea is to use a centralized set of parameter servers which maintain the shared state of the system, coupled with a large number of worker nodes which actually perform computation. Worker nodes push state updates and pull down state. The system highly leverages the fact that machine learning algorithms are generally convergent and are not significantly disrupted by having stale state, so they don't require synchronous state updates.

One trade-off they discuss is convergence rate vs per-iteration performance. As state becomes more stale, things converge less slowly, but having more stale state allows you to do more computation asynchronously.

Sunday, October 25, 2015

Review of "Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary"

This paper aims to present a solution to the issue of maintaining consistency across multiple datacenters. As services become increasingly global, this becomes important to decrease latency to users across the globe, which also ensuring consistency.

The main idea is to separate operations into two categories: Blue (commutative, eventual consistency) and Red (strong consistency). They provide guidelines for how to determine what operations fall into each category, plus a system for forcing more operations to fit into the blue category: decomposing operations into generator and shadow operations, which can turn some non-commuting operators into commuting ones.

This work comes about most likely because the scale of geo-replication that exists now is significantly larger than in the past, and being able to maintain low latency (small consistency overheads) is very important - higher latency has a significant correlation to decreased per-user profits.

There are a number of trade-offs here. Using blue operations only guarantees eventual consistency, which still may come with issues ("eventual consistency is no consistency"), though it provides much higher speed. Breaking down operations into generator/shadow operations may grant much lower latency, but also makes things more difficult to reason about.

I can see this being influential in the future - services are becoming increasingly more global and people expect increasingly lower latencies, but application programmers also want consistency. This seems to provide a good framework for maintaining both of these things simultaneously.

Review of "CRDTs: Consistency without Concurrency Control"

Attempting to balance strong consistency models with low concurrency control overhead is a very important line of research because as systems grow larger our current concurrency control schemes have overheads which are often prohibitively high, but some reduced-consistency schemes make things difficult to reason about for application programmers and result in complex handling logic. Any work to bridge this gap is certainly solving a meaningful problem.

The main idea is to use CRDTs, commutative replicated data types, on which all operations commute, to allow for the construction of data structures without communication: if all operations can be applied in any order, just apply it locally and them disseminate the operation, since eventually it will reach all replicas and we don't mind if it was applied out-of-order. Note that this is still eventual consistency and makes no guarantees about how soon things will be consistent, just that they will eventually reach such a state.

I think this work was different from previous work partly because of the size of systems in today's world; concurrency control schemes which previously worked fine are reaching the limits of their scalability and it is important that we investigate new ways to achieve the same consistency guarantees in more scalable ways.

One trade-off is generality vs. speed; not everything can easily be represented as a CRDT. There is also the trade-off of ease of use; CRDTs are only eventually consistent, and this must be planned for as an application programmer.

I can definitely see this paper being influential moving forward--though I don't see CRDTs in this form being widely used, the ideas are important.

Saturday, October 24, 2015

Review of "Coordination Avoidance in Database Systems"

Scalability is hugely important in today's world, and communication/coordination is the bane of scaling. This paper works towards reducing the amount of coordination necessary for maintaining correctness, a very important problem.

The main idea is to analyze transaction sets against application invariants to determine when exactly coordination is necessary, in contrast to traditional database systems which serialize all transactions. By only coordinating when absolutely necessary, many transactions can run independently in parallel.

A trade-off here is complexity vs. speed. It is much easier to reason about transactions as completely serializable, and you don't need to write down all of your invariants, but this will often be worth it in a high performance setting.

This work seems very interesting - I feel that most literature I have seen uses the same concurrency scheme for all transactions, and the idea of predetermining which transactions actually need coordination seems like it will have a lot of practical benefits as systems scale larger and larger. I can definitely see this being influential in 10 years.

Wednesday, October 21, 2015

Review of "Shielding Applications from an Untrusted Cloud with Haven"

They are solving a very interesting problem: shielding application-level code from the OS that is executing it. I am not convinced that this is a real problem in today's world...

The main idea of their solution is to use hardware-provided instructions (Intel SGX) to be able to allocate protected memory regions (enclaves) within which the application's execution code is protected, even from the OS which is running that code. Haven takes a viewpoint that neither the OS nor the application trust each other, and provide interesting ways for them to provide services to each other despite this limitation.

This work is emerging because it is becoming increasingly more common to run your application on hardware that is managed by others, e.g. Microsoft Azure, Amazon EC2, etc. Previously, you managed the hardware, but were concerned about application-level possibly doing some harmful. Now, the hardware providers still need to be concerned with that same problem, but application-level code is also dealing with an outside entity and may want to be protected.

A big trade-off here is speed; the extra protection comes at a cost of higher latency because of things like system calls being more expensive, and generally using SGX extensions has a bit higher cost.

I don't really see this being overly influential in 10 years - like I said earlier, I am not convinced this is a real issue.

Tuesday, October 20, 2015

Review of "CryptDB: Protecting Confidentiality with Encrypted Query Processing"

This paper addresses an issue that has become increasingly more important as of late, viewable as an issue even to those not involved in the software community. Data breaches are becoming increasingly more common, and increasingly more devastating, so encryption of sensitive data is very important. Having a DB that makes this directly in to its storage seems very promising.

The main nugget is to store the data in an encrypted format inside of the DB, but store it such that SQL queries are still able to execute over the data. This is made possible because SQL has a specific subset of operations that it will run, and thus the encryption scheme can be aware of these operations. Though sometimes data will need to be decrypted for certain operations, CryptDB attempts to minimize the amount of data that is decrypted.

I cannot say exactly why this is new work - perhaps partially because security has become an increasingly larger concern as more data is stored online and data hacking has become increasingly widespread and damaging.

The primary trade-off here is resource cost vs security. A fully homomorphic cryptography scheme would provide even better security, since it could execute all of the necessary operations without decrypting data, but would be very prohibitively CPU intensive. CryptDB attempts to provide security while still providing reasonable real-world performance, and seems to do a good job at this.

I can see this being influential in 10 years -- companies are being slammed harder and harder by data breaches and anything that can remedy this must pique the interest of many large companies.

Monday, October 19, 2015

Review of "A Sampling Algebra for Aggregate Estimation"

While there is a great deal of work going into making databases fast so that you can process the wealth of data you collect in today's world, it still remains useful to do sampling to obtain a quick estimate of a result. But in today's world, you often have no idea (except perhaps an intuition) how accurate that data is. People have figured it out in some cases, but not in a way that is at all general. This paper serves to solve that, with a main contribution of the idea of a GUS (generalized uniform sampling) operator which can describe any sampling method, as well as the idea of Second Order Analytic (SOA) equivalence to denote the equivalence of two query plans in terms of expected value and variance. Combining these with supported relational operators provides a way to reason about and calculate error bounds for nearly any type of sampling.

It seems to be that this is new because as data volumes grow, it becomes ever more important to be able to sample and sample confidently.

There is a trade-off here in that the generality of the GUS also makes it harder to reason about - I certainly had trouble really understanding what things meant.

I can see this being influential in 10 years; they provide a general use tool for others to use in a space that can only become increasingly more important as we move forward.

Friday, October 9, 2015

Review of "Succinct: Enabling Queries on Compressed Data"

Everyone needs to store enormous amounts of data these days, and everyone wants to be able to access it quickly. Succinct presents the problem that storing large quantities of data means you want to be as space-efficient as possible, but to access it quickly you (generally) build indices, which are not at all space-efficient. I am unsure how frequent of a problem this is in practice, but I imagine that there are many use cases for Succinct where it would greatly advance the current state of the art.

The main idea of Succinct is to essentially build an indexing system into the compression. This eliminates the need for space-costly secondary indexes, while also providing fast search capabilities.

I think this is probably different from previous work because the quantities of data that are now being stored in e.g. NoSQL systems are blooming hugely, and in the past typical systems did not need to rely as heavily on compression and space awareness, but the increase in data volume has made it a very necessary feature. Yet, at the same time, the data still needs to be accessed quickly, leading to Succinct.

The trade-off here is, of course, speed vs. space. Succinct falls into a pretty happy middle ground between the two, though it still falls short in some areas, e.g. large sequential reads.

I can see this being influential in the future - this seems to be a very new way of thinking about compressed storage that should be very useful.

Review of "Mining Modern Repositories with Elasticsearch"

As companies gain increasingly larger amounts of data, they need scalable solutions to deal with that data. Most of what we have looked at in this class has been somewhat complex - how to process large amounts of data, etc. - but we can't forget about one of the simplest applications: simply finding things within this wealth of data, aka search.

This is different from previous offerings simply because of the volume of data - local searching and indexing is by no means a new concept, so the main contribution of Elasticsearch was intelligently distributing and sharding this data to be able to access it quickly in a scalable manner.

The trade-off is primarily complexity vs speed. Elasticsearch is pretty simple: you get textual searches on certain fields, even letting you define a range for e.g. numeric types, but that's about it. There's no joining, aggregation, etc. On the other hand, this enables it to run extremely quickly on very large datasets.

This paper in particular I don't see being influential in 10 years, but I do see Elasticsearch in general (or at least, some derivative) continuing to be very important moving forward.

Monday, October 5, 2015

Review of "Spark SQL: Relational Data Processing in Spark"

Spark SQL solves two interesting problems simultaneously: better support for declarative / SQL programming on nested / semistructured / big data, and better integration of procedural with declarative programming.

The main idea is to provide a DataFrames API which abstracts a data source (RDD, Hive table, CSV file, etc) and provides relational-style operations which can be optimized. This allows developers to seamlessly operate on many different data storages, including Java/Python objects. They also introduce the Catalyst optimizer, a highly extensible query optimizer. One important thing to notice, in my opinion, is how flexibly the entire system was built: it is easy to define new data sources, optimizations, UDTs, and UDFs, all of which play nicely together.

One thing about this paper that I think will continue to be influential is the idea of mixing declarative / SQL programming with procedural programming. While it has always been possible to some extent using UDFs, Spark SQL provides a much more integrated intermingling that seems both easier to work with and more flexible.

Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

Dremel was (one of?) the first real solutions for interactive-latency SQL queries running side-by-side with MapReduce and other Hadoop-type applications, solving the very problem of being able to run interactive queries on your data without having to export it away from where batch processing is occurring.

Two main ideas used are to use columnar storage for representing nested data (which is handled natively), and to use a "serving tree" style architecture to distribute queries among a great deal of "leaf" nodes which scan data from disk and a hierarchy of intermediate nodes to perform aggregation. This also differed from Hive and Pig because it did not build on top of MapReduce jobs.

This is different from previous work because previously, data stored e.g. on GFS was being used primarily for batch computation. However, as increasingly more data was moved onto GFS and it became a "one stop shop" of sorts for data, it became more desirable to be able to run interactive-latency queries over GFS without the need to export data into a more traditional DBMS. This is sort of a combination of new system requirements and new workload: data volume was overwhelming traditional DBMSes, and new workloads required lower latency than existing solutions.

One trade off is the use of columnar vs. record-oriented storage. As their experiments show, when numerous fields of a record are accessed in a query, it can be faster to use record-oriented storage. However, as the more common case is to access only a small subset of the fields in a record, columnar storage turns out to be much more suited to their use case.

This paper has certainly been very influential; ideas from Dremel have almost certainly been incorporated into a number of other systems for similarly performing SQL-like queries over data. The use of columnar data format, a somewhat new idea especially for nested data, has also gained a great deal of traction, possibly in part due to Dremel.

Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

Impala aims to solve a very real problem - how do we get the most out of all of the data we have stored in Hadoop (i.e. HBase or HDFS)? It does so in the direction of attempting to optimize SQL-type queries in an interactive, low-latency manner (as opposed to the higher-latency, bulk processing semantics of Apache Hive). The hope is that you will no longer need to export data from Hadoop to traditional RDBMS data warehousing for analytics and business intelligence.

Impala, rather than owing its speed to one new core idea, seems to me to be built on top of a number of number of ideas that work together to provide overall efficiency gains. They use code generation instead of interpretation to provide more efficient queries at the instruction level (previously employed by e.g. Hekaton and others), remove centralized control from the query path (statestored and catalogd both push changes rather than the individual Impala daemons asking for information on each query), seem to provide better query optimization than previous SQL-on-Hadoop systems, and use short-circuit local reads to bypass the full HDFS DataNode protocol (a relatively new addition to Hadoop that other systems can easily take advantage of as well), allowing for much faster local disk reads.

I see Impala as a next step past Hive in the SQL-on-Hadoop world; Hive was an initial attempt to bring SQL to Hadoop, but was never really that great. Yet, despite that, it gained a lot of users, a testament to how badly end users want SQL on Hadoop. I think this means that it is natural that now many systems are addressing this problem (SparkSQL, Dremel, Drill, Presto, etc).

One trade-off I recognize: their metadata push model means that metadata may be stale, but they disregard this as a non-issue since the query plans are created on a single node using consistent metadata.

I don't see Impala itself being particularly influential in 10 years - while it is a good implementation, it doesn't seem to have a large number of new ideas.

Sunday, October 4, 2015

Review of "Cassandra - A Decentralized Structured Storage System"

As with Bigtable, Cassandra is solving a real issue of storing large amounts of structured data in the middleground between KV stores and relational databases, at very large scale. This is becoming increasingly more important.

The main idea of Cassandra is to decentralize all of the processing, moving a way from a typical master-slave architecture. Decision making, e.g. failure detection, is all done in a decentralized manner through the use of Gossip. This removes reliance on a single point of failure or contention.

While this removal of reliance on a master is nice, it makes me wonder, is it really necessary? We saw in Bigtable that a master-server system can be made very efficient by making the operations that it carries out infrequent (e.g. in Bigtable most requests don't involve the master). Is the extra complexity of the decentralized system worth it? Cassandra's widespread adoption may point to yes, but I am not entirely convinced.

I don't particularly see this being highly influential in 10 years - while Cassandra is a well designed system incorporating many good ideas, it didn't seem that it had many very original ideas, just a well put-together set of existing ideas.

Review of "Bigtable: A Distributed Storage System for Structured Data"

Bigtable addresses a problem that, although it has been addressed many times I many different ways, is still real: how best to store data at an enormous scale. Part of why it is not a completely solved problem is that different workloads and use cases demand different requirements.

The main idea is to use an distributed and immutable storage system in which each server is responsible for a number of data tablets, each of which is comprised of a number of SSTable files which actually store the data. These are immutable, and merges across them (as well as with the running log) give the current view of the tablet. Interestingly, metadata is stored as a specific table but otherwise does not differ from data in the system.

This was different partially because of the "Google scale" problem, requiring more scalability than most systems appearing previously. Many systems attempted to be too expressive (full relational) or not expressive enough (simple key value store), and Google decided it needed a good middle ground.

As discussed above, there is a fundamental trade off in terms of expressitivity: more expressive power means you can reason about and structure your data more fully, but is also more difficulty to scale. Google chose a middle ground approach with Bigtable. There is also a trade off with their decision to use immutable data storage. This makes writes faster (since they go a log and aren't in place) and simplifies the system architecture, but adds overheads on reads (the necessity to merge) and a background GC overhead.

It has been nearly 10 years since this paper was written, and it is still influential in the sense that Bigtable is still in use. But, it does seem tailored towards Google's needs, and I don't think it has been extremely influential in other areas.