Monday, October 5, 2015

Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

Dremel was (one of?) the first real solutions for interactive-latency SQL queries running side-by-side with MapReduce and other Hadoop-type applications, solving the very problem of being able to run interactive queries on your data without having to export it away from where batch processing is occurring.

Two main ideas used are to use columnar storage for representing nested data (which is handled natively), and to use a "serving tree" style architecture to distribute queries among a great deal of "leaf" nodes which scan data from disk and a hierarchy of intermediate nodes to perform aggregation. This also differed from Hive and Pig because it did not build on top of MapReduce jobs.

This is different from previous work because previously, data stored e.g. on GFS was being used primarily for batch computation. However, as increasingly more data was moved onto GFS and it became a "one stop shop" of sorts for data, it became more desirable to be able to run interactive-latency queries over GFS without the need to export data into a more traditional DBMS. This is sort of a combination of new system requirements and new workload: data volume was overwhelming traditional DBMSes, and new workloads required lower latency than existing solutions.

One trade off is the use of columnar vs. record-oriented storage. As their experiments show, when numerous fields of a record are accessed in a query, it can be faster to use record-oriented storage. However, as the more common case is to access only a small subset of the fields in a record, columnar storage turns out to be much more suited to their use case.

This paper has certainly been very influential; ideas from Dremel have almost certainly been incorporated into a number of other systems for similarly performing SQL-like queries over data. The use of columnar data format, a somewhat new idea especially for nested data, has also gained a great deal of traction, possibly in part due to Dremel.

No comments:

Post a Comment