Session abstract:
The essence of near-real-time stream processing is to compute huge volumes of data as it is received. This talk will focus on creating a pipeline for collecting huge volumes of data anfd processing near-real time using Storm.
Storm is a high-volume, continuous, reliable stream processing system developed at BackType and open-sourced by Twitter. Storm is being widely used in lot of organizations and has variety of uses-cases like:
* Realtime analytics
* Distributed RPC
* ETL etc.
During the course of 40 minutes using an example of Real-time Wikipedia edit we will try and understand:
* Basic concepts of stream-processing.
* High level understanding of components involved in Storm.
* Writing producer in Python which will will push in Queue the real-time edit feed from Wikipedia.
* Write storm topologies in python to consume feed and process real-time metrics like:
* Number of articles edited.
* Category wise count of articles being edited.
* Distinct people editing the articles
* GeoLocation counters etc.
* Technological challenges revolving around near-real time stream processing systems:
* Achieve low latency for processing as compared to batch processing.
* State-management in workers to maintain aggregated counts like counting edits for same category of articles.
* Handling failures and crashes
* Deployment Startergies.