< Back

Self Service Data Lakes, Apache Spark & Apache Ignite, and Scalable Clusters on Demand

January 31, 2018 6:00 PM

Building a Self-Service Data Lake on Google Cloud Platform

Ali Anwar Cask

With the latest technology options for big data processing, storage, and resource management easily accessible in the cloud, more and more organizations are ready to build their data lake in the cloud. But as in the on-premises world, challenges remain with respect to integrating data, operationalizing, securing and governing the data lake, and enabling self-service access to data with “IT guardrails”.

In this talk, Ali Anwar will demonstrate how Cask Data Application Platform (CDAP) helps architects, developers and data scientists avoid the complexities and inefficiencies of the messy and diverse nature of big data, and how to use its comprehensive platform capabilities, frameworks and self-service tools to go from data prep to a fully operational data lake on the Google Cloud Platform (GCP). Ali will highlight GCP-specific integrations in CDAP, and describe popular use cases such as Change Data Capture, cloud migration and machine learning/AI.

Apache Spark and Apache Ignite: Where Fast Data Meets the IoT

Denis Magda GridGain

It is not enough to build a mesh of sensors or embedded devices to obtain more insights about the surrounding environment and optimize your production systems. Usually, your IoT solution needs to be capable of transferring enormous amounts of data to storage or the cloud where the data have to be processed further. Quite often, the processing of the endless streams of data has to be done in real-time so that you can react on the IoT subsystem’s state accordingly.

This session will show attendees how to build a Fast Data solution that will receive endless streams from the IoT side and will be capable of processing the streams in real-time using Apache Ignite’s cluster resources.

Scalable Clusters on Demand

Bogdan Kyryliuk & Gustavo Torres Opendoor

At Opendoor, we do a lot of big data processing, and use Spark and Dask clusters for the computations. Our machine learning platform is written in Dask and we are actively moving data ingestion pipelines and geo computations to PySpark. The biggest challenge is that jobs vary in memory, cpu needs, and the load in not evenly distributed over time, which causes our workers and clusters to be over-provisioned. In addition to this, we need to enable data scientists and engineers run their code without having to upgrade the cluster for every request and deal with the dependency hell.

To solve all of these problems, we introduce a lightweight integration across some popular tools like Kubernetes, Docker, Airflow and Spark. Using a combination of these tools, we are able to spin up on-demand Spark and Dask clusters for our computing jobs, bring down the cost using autoscaling and spot pricing, unify DAGs across many teams with different stacks on the single Airflow instance, and all of it at minimal cost.