Session abstract:
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
- Pipeline functionality from event source through queryable state for real-time insights.
- API for application development and development process.
- Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
- Stateful processing with event time windowing.
- Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
- Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
- Who is using Apex in production, and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.