Apache Spark? If only it worked.

06/12/2017 - 11:00 to 11:40
long talk (40 min)

Session abstract: 

Do you have plans to start working with Apache Spark? Are you already working with Spark but you haven’t gotten the expected performance and stability and you are not sure where to look for a fix?

Spark has a very nice API and it promises high performance for crunching large datasets. It’s really easy to write an app in Spark, unfortunately, it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason.

This talk will consist of multiple common problems you might face when running Spark at full scale and, of course, solutions for solving them. Each of the problems I will cover will come with well-described background and examples so that it will be understood by people with no Spark experience. However, people who are working with Spark are the main audience. The ultimate objective is to give the audience a practical framework for optimizing the most common problems with Spark applications.

  • Classes of problems in the presentation:
  • Dealing with skewed data
  • Spark on YARN and its memory model
  • Caching
  • Sizing executors
  • Locality