Who Moved my Data? - Why tracking changes and sources of data is critical to your data lake success
Russ Savage Cask
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
One size doesn’t fit all: making a case for Federated Data Science using Ampool
Nitin Lamba & Suhas Gogate Ampool
Anomaly detection is a very common pattern used not only in financial transactions but also in finding abnormal behavior in health monitoring and IoT. What’s even more common is multiple analytical tools used in data science (Python, R, Apache Spark, to name a few) especially in large multi-tenant environments. Enterprises spend a lot of time moving & copying data to cater to these needs. Instead of having disparate back-end systems feed these tools, a simpler approach is to separate the concerns for compute and fast data serving.
In this talk, we will walk through such an anomaly detection use-case, where an in-memory data service layer serves hot, high-value data to different tools from a single, scalable cluster. This not only reduces data copies but also mitigates operational complexity (less number of moving parts). We illustrate how a single data flow can use these multiple engines, making timely actionable insights a reality, and run concurrent analytics workloads at in-memory speeds.
Analyze Ad impressions at speed of thought using Spark 2.0 and Snappydata
Jags Ramnarayan SnappyData
In Ad Analytics you have to deal with consolidated ad impression streams from many sites, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data.
Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Spark 2.0 cluster – stream ingestion(parallel ingest, continuous stream analytics), storing into a in-memory store, overflowing to Hadoop and interactive analytic queries that combines history with streams. A design that is simpler, and a lot more efficient.
We cover how the new Spark 2.0 enhancements make continuous analytics very simple but also talk about how deeply integrating a transactional+analytics in-memory database fully collocated with Spark executors offers significant benefits – Spark is now capable of managing mutable, transactionally consistent data, indexes, and can run concurrent analytics queries at in-memory speeds.