Building data lineage; Running Dataproc with Alluxio

November 21, 2019 6:00 PM

Fine grained root cause and impact analysis with CDAP Lineage

Yuki Jung Google

Lineage is a critical aspect of data governance in large enterprises, and provides traceability for data as it flows through a data system. It can unlock various use cases such as root cause analysis (discover the cause of a bad data event) and impact analysis (gauge the impact of a change before making the change). In this talk, the speaker will demonstrate how CDAP’s granular data lineage capabilities can solve these use cases for enterprises.

Accelerating workloads and bursting remote data with Google Dataproc using Alluxio

Dipti Borkar & Roderick Yao Alluxio

Google Cloud Dataproc is a popular managed on-demand service to run Spark, Presto and many other compute workloads. Alluxio, an open source data orchestration technology, helps speed up Dataproc workloads by providing a distributed caching layer within the Dataproc Cluster. In addition, Alluxio enables “Zero-copy” bursting allowing users to run compute workloads even on data that’s remote on-prem or another cloud. In this session, Dipti from Alluxio and Roderick from Google Cloud will share an overview of Alluxio and Google Dataproc and the benefits the two together bring. It will include a demo of initializing a Dataproc cluster with Alluxio to run workloads on remote data.