< Back

Apache Kudu, Apache Phoenix + Tephra, and Apache Drill

January 27, 2016 6:00 PM

Simplifying big data analytics with Apache Kudu

Mike Percy Cloudera

The Hadoop ecosystem has been making great strides in recent years. With systems such as Apache HBase and Apache Cassandra, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. However, the scan performance of these systems is not optimal.

On the other end of the spectrum, columnar file formats such as Apache Parquet and Hive ORCFile are designed for very fast scan rates, offering great performance benefits to many SQL and analytics applications. Unfortunately, there is little to no ability for real-time modification or row-by-row indexed access when using these file formats.
Kudu was designed from the ground up to address this gap. Kudu offers real-time random read / write access to records, while also storing data in a columnar format, providing both exceptional scan performance and competitive random access performance, combining many of the benefits of the above systems and formats. This talk will discuss how Kudu can be used as a single storage system to greatly simplify analytical big data applications.

Apache Phoenix: OLTP in Hadoop

James Taylor Salesforce.com

This talk will examine how Apache Phoenix, a top level Apache project, differentiates itself from other SQL solutions in the Hadoop ecosystem. It will start with exploring some of the fundamental concepts in Phoenix that lead to dramatically better performance and explain how this enables support of features such as secondary indexing, joins, and multi-tenancy. Next, an overview of ACID transactions, a new feature available in our 4.7.0 release, will be given along with an outline of the integration we did with Tephra to enable this new capability. This will include a demo to demonstrate how Phoenix can be used seamlessly in CDAP. The talk will conclude with a discussion of some in flight work to move on top of Apache Calcite to improve query optimization, broaden our SQL support, and provide better interop with other projects such as Drill, Hive, Kylin, and Samza.

SQL-on-Everything with Apache Drill

Julien Le Dem Dremio

In recent years, the rise of modern, non-relational datastores such as NoSQL databases, Hadoop and cloud storage has made it easier for developers to build and scale applications. However, these datastores make it harder for business users and analysts to analyze the data. In many cases, data engineers must develop complex ETL pipelines, loading the data into a centralized relational database or SQL-on-Hadoop environment.

Apache Drill is an open source, in-memory, columnar SQL execution engine. It enables users and BI tools to execute large-scale, interactive SQL queries against one more datastores. Drill supports NoSQL databases (eg, MongoDB, HBase, Kudu), search (eg, Elasticsearch, Solr), file systems (eg, HDFS, NAS), cloud storage (eg, Amazon S3, Azure Blob Storage) and relational database (eg, MySQL, Oracle). Users can run queries on a single system or join data between multiple systems. For example, a user can join log files in Elasticsearch with user profiles in MySQL or even an Excel spreadsheet.

In this talk we provide an overview of Apache Drill, and explain how to use it to query data in one or more datastores, with a particular emphasis on modern, non-relational datastores.