Apache Flink deep-dive

Scale
06/02/2015 - 12:20 to 13:00
Stage 3
long talk (40 min)
Intermediate

Session abstract: 

Apache Flink is a distributed engine for batch and streaming data analysis. Flink offers familiar programming APIs based on parallel collections that represent data streams and transformations and window definitions on these collections.

Flink supports these APIs with a robust execution backend. Both batch and streaming APIs are backed by the same execution engine that has true streaming capabilities, resulting in true real-time stream processing and latency reduction in many batch programs. Flink implements its own memory manager and custom data processing algorithms inside the JVM, which makes the system behave very robustly both in-memory and under memory pressure. Flink has iterative processing built-in, implementing native iteration operators that create dataflows with feedback. Finally, Flink contains its own cost-based optimizer, type extraction, and data serialization stack.

The end result is a platform that is fast, easy to program against, unifies batch and stream processing without compromising on latency or throughput, requires very little tuning to sustain data-intensive workloads, and solves many of the problems of heavy data processing inside the JVM.

This talk gives an overview of Flink’s APIs, the most important features of Flink’s runtime and the benefits for users, as well as a roadmap of the project.

Video: 

Slide: