Building Large Scale Applications on Apache Hadoop YARN with Apache Twill
Poorna Chandra Cask
Twill is an Apache incubator project that provides higher level abstraction to build distributed systems applications on YARN. Developing distributed applications using YARN is challenging because it does not provide higher level APIs, and lots of boiler plate code needs to be duplicated to deploy applications. Developing YARN applications is typically done by framework developers, like those familiar with Apache Flink or Apache Spark, who need to deploy the framework in a distributed way.
By using Twill, application developers need only be familiar with the basics of the Java programming model when using the Twill APIs, so they can focus on solving business problems. In this talk I present how Twill can be leveraged and an example of Cask Data Application Platform (CDAP) that heavily uses Twill for resource management.
Introduction to large-scale Machine Learning with Apache Flink
Theodore Vasiloudis SICS
Apache Flink is an open source platform for distributed stream and batch data processing. In this talk we will show how Flink’s streaming engine and support for native iterations make it an excellent candidate for the development of large scale machine learning algorithms.
This talk will focus on FlinkML, a new effort to bring scalable machine learning tools to the Flink community. We will provide an introduction to the library, illustrate how we employ some state-of-the-art algorithms to make FlinkML truly scalable, and provide a view into the challenges and decisions one has to make when designing a robust and scalable machine learning library.
Finally, if time permits, we will demonstrate how one can perform some interactive analysis using FlinkML and the notebook environment of Apache Zeppelin.
Ambry: Linkedin's Scalable Geo-distributed Object Store
Sivabalan Narayanan LinkedIn
Ambry is an open-source geo-distributed highly available and horizontally scalable object store built at LinkedIn. It is an active-active, immutable, eventually consistent handle store that can be configured to provide different levels of consistency. At LinkedIn, Ambry runs on hundreds of nodes spanning multiple data centers and is the source of truth for media and other immutable content.
The talk starts with discussing the need for a scalable, geo-distributed and highly available object store in a media centric world and how Ambry acts as a single source of truth for all immutable needs for Linkedin. We will go over some of the design decisions that helped Ambry to scale for both large and small objects and how these helped to solve the main pain points of some of the existing problems. In addition, talk also covers the use-cases for which one could use Ambry for. Second part of the talk goes over the architecture of Ambry and the talk ends with our road map.