CS294 Paper Review Blog: Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

Impala aims to solve a very real problem - how do we get the most out of all of the data we have stored in Hadoop (i.e. HBase or HDFS)? It does so in the direction of attempting to optimize SQL-type queries in an interactive, low-latency manner (as opposed to the higher-latency, bulk processing semantics of Apache Hive). The hope is that you will no longer need to export data from Hadoop to traditional RDBMS data warehousing for analytics and business intelligence.

Impala, rather than owing its speed to one new core idea, seems to me to be built on top of a number of number of ideas that work together to provide overall efficiency gains. They use code generation instead of interpretation to provide more efficient queries at the instruction level (previously employed by e.g. Hekaton and others), remove centralized control from the query path (statestored and catalogd both push changes rather than the individual Impala daemons asking for information on each query), seem to provide better query optimization than previous SQL-on-Hadoop systems, and use short-circuit local reads to bypass the full HDFS DataNode protocol (a relatively new addition to Hadoop that other systems can easily take advantage of as well), allowing for much faster local disk reads.

I see Impala as a next step past Hive in the SQL-on-Hadoop world; Hive was an initial attempt to bring SQL to Hadoop, but was never really that great. Yet, despite that, it gained a lot of users, a testament to how badly end users want SQL on Hadoop. I think this means that it is natural that now many systems are addressing this problem (SparkSQL, Dremel, Drill, Presto, etc).

One trade-off I recognize: their metadata push model means that metadata may be stale, but they disregard this as a non-issue since the query plans are created on a single node using consistent metadata.

I don't see Impala itself being particularly influential in 10 years - while it is a good implementation, it doesn't seem to have a large number of new ideas.

CS294 Paper Review Blog

Monday, October 5, 2015

Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

No comments:

Post a Comment