Monday, October 5, 2015

Review of "Spark SQL: Relational Data Processing in Spark"

Spark SQL solves two interesting problems simultaneously: better support for declarative / SQL programming on nested / semistructured / big data, and better integration of procedural with declarative programming.

The main idea is to provide a DataFrames API which abstracts a data source (RDD, Hive table, CSV file, etc) and provides relational-style operations which can be optimized. This allows developers to seamlessly operate on many different data storages, including Java/Python objects. They also introduce the Catalyst optimizer, a highly extensible query optimizer. One important thing to notice, in my opinion, is how flexibly the entire system was built: it is easy to define new data sources, optimizations, UDTs, and UDFs, all of which play nicely together.

One thing about this paper that I think will continue to be influential is the idea of mixing declarative / SQL programming with procedural programming. While it has always been possible to some extent using UDFs, Spark SQL provides a much more integrated intermingling that seems both easier to work with and more flexible.

No comments:

Post a Comment