Building Data pipelines with Cask Hydrator
Gokul Gunasekaran Cask
Cask Hydrator is an extension to the open source Cask Data Application Platform (CDAP) that simplifies the process of developing and operating realtime and batch data pipelines on Hadoop. Hydrator’s web-based drag-and-drop UI allows users to quickly build hadoop-scalable, distro-agnostic data pipelines without writing any code.
Powered by CDAP (http://cdap.io), Hydrator provides ease of operability through metadata information, lineage, metrics and log collection in a single location. In this talk, we will build data pipelines, with real-life applications, that pull in data from multiple sources, train and use a machine learning model to classify data using Spark MLLib, and write data to different sinks. We will also delve under the covers to see how these data pipelines are transformed to a series of MapReduce/Spark jobs and also touch upon some interesting challenges we had to tackle while developing Hydrator.
PXF: A Unified access framework for distributed data systems on HDFS
Shivram Mani Pivotal
We are in a world with multiple storage systems optimized for different data models. This poses the challenge of running aggregate analysis across the varied storage engines on HDFS.
PXF provides a unified extensible framework for solving this precise problem. The pluggable framework makes it very convenient to add plugins to support custom data sources. Existing plugins include loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, Avro, Sequence, Hive RCFile, ORC, Parquet and Avro formats and HBase.
Example use cases include using statistical and analytical functions along with filter pushdown from Postgres or Apache HAWQ on Hdfs, HBase and Hive data, joining in-database dimensions with HBase facts, leveraging analytical capabilities on Hadoop data files, and fast ingest of data into HAWQ for in-database processing and analytics.
PXF is an open source project that is currently being used by Apache HAWQ, and is on the process of being integrated with other SQL engines.
Illia Polosukhin Google
Deep Learning has unlocked new types of interactions, products, and understanding, especially in the last 5 years. The field is moving very quickly, and getting the latest innovations into released products is a challenge. TensorFlow is growing into a platform for letting researchers and industry collaborate, as it provides tools for experimentation and deep learning. This talk will cover some recent developments in Deep Learning and what possible experiences entrepreneurs and hackers can build using TensorFlow.