Session abstract:
We're coming together for Berlin Buzzwords' 7th edition and over the course of the years a lot has changed in the Big Data Technology ecosystem. Once-hot buzzwords have vanished and new buzzwords arose.
While you would probably have written a MapReduce job in Java to crawl the web and analyze it on a massive scale this has now become much simpler with tools like Spark and Flink at hand.
I want to do a live coding session where I show that today it is possible to write a scalable web crawler and analytics tool which scrapes the past 6 years of Berlin Buzzwords (websites) and shows some interesting insights in the Big Data trends of the past 6 years. While I will run the tool on the very limited data set of the historical Berlin Buzzwords websites I want to highlight that it would in principle scale to crawl millions of websites and analyze petabytes of data.