Apache Kafka: Leveraging Real-time Data at Scale
Since it was open sourced, Apache Kafka has been adopted very widely from web companies like Uber, Netflix, LinkedIn to more traditional enterprises like Cerner, Goldman Sachs and Cisco. At these companies, Kafka is used in a variety of ways – as a pipeline for collecting high-volume log data for load into Hadoop, a means for collecting operational metrics to feed monitoring and alerting applications, for low latency messaging use cases and to power near realtime stream processing.
Kafka’s unique architecture allows it to be used for real time processing as well as a bus for feeding batch systems like Hadoop. Kafka is fundamentally changing the way data flows through an organization and presents new opportunities for processing data in real time that were not possible before. The biggest change this had led to is a shift in the way data is integrated across a variety of data sources and systems.
In this talk, Neha will discuss how companies are using Apache Kafka and where it fits in the Big Data ecosystem.
NRT Event Processing with Guaranteed Delivery of HTTP
Poorna Chandra Cask
At Salesforce, we are building a new service, code-named Webhooks, that enables our customers’ own systems to respond in near real-time to system events and customer behavioral actions from the Salesforce Marketing Cloud. The application should process millions of events per day to address the current needs and scale up to billions of events per day for future needs, so horizontal scalability is a primary concern. In this talk, we will discuss how Webhooks is built using HBase for data storage and Cask Data Application Platform (CDAP), an open source framework for building applications on Hadoop.
Kite: Helping Hadoop Projects Work Together
Ryan Blue Cloudera
Here’s a great idea for databases: let’s provide SQL like normal, but let’s also allow users muck around with their data directly! A few short years ago, that would have been a joke. Today, MPP databases like Impala have exactly that problem.
Big data applications on Hadoop commonly require several projects from the ecosystem working together on the same data. Interoperability between those projects remains a big challenge for developers because those projects all interact with datasets differently. When Spark writes files with an OutputFormat, how does Impala know how to read it?
In this talk, Ryan will introduce Kite, a data-focused API for Hadoop, and talk about how we are using it to address the interoperability problem.