Self-Service Data Integration using Apache Spark
Edwin Elia Cask
Enterprises are seeing an increasing need to ingest high volumes of data from a wide variety of structured and unstructured sources. The data ingestion from a variety of different sources often includes steps to cleanse, transform, and prepare the data before landing the data in a data lake. To do so, organizations are increasingly embracing the notion of self-service to allow data engineers, data scientists and citizen integrators to prepare and ingest data from a variety of different sources. In this talk, we will cover how Cask approaches data preparation and data ingestion by providing self-service tools to integrate the data, yet not compromising on strict enterprise guidelines around security and governance. We will also demonstrate the journey of a data engineer/data scientist in preparing, transforming data and building a production-grade data pipeline end-to-end, using Apache Spark, with the clicks of just a few buttons.
Making Big Data Go Faster
Morgan Littlewood Kodiak Data
When developing and deploying complex and time sensitive analytics applications, cluster resources must be configured and balanced. Multiple, complex stacks may be needed and each having its own CPU, RAM and disk capacity requirements. Storage and network performance are also critical factors in provisioning development and production clusters. How do you configure and size cluster nodes, especially when usage and data may be growing rapidly? Today, many data teams over-engineer their production clusters on ‘bare metal’ servers, however, Big Data infrastructure can be shared across many clusters. Kodiak Data software enables a ‘virtual cluster infrastructure’ (VCI) where ‘virtual clusters’ are isolated from each other and abstracted on the physical infrastructure. Cluster virtualization simplifies operations and significantly improves asset utilization. Application and ALL popular BD stacks run unchanged and orchestration software such as Kubernetes and Mesos can still be used.
Data Pipelines in Kubernetes
Sean Suchter Pepperdata
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Kubernetes is already used extensively to run stateless applications both on-premise and on the cloud. It is increasingly being used for stateful applications like databases, message queues, etc as well. An emerging use-case is data processing workloads. Some of the efforts of the open source community over the past few months have been to support this – by enabling workloads like Spark and HDFS to run well on Kubernetes. In this talk, I cover the various parts of a containerized data processing pipeline in Kubernetes using an example, and talk briefly about trade-offs and performance considerations.