Go beyond "hellow world" in this part one Spark Internals workshop led by David Drummond, who will teach you how to configure executor and driver memory, use broadcast variables, and optimize code based on Spark internals, before jumping into practical, hands-on tutorials in part two.

In this workshop, you will learn:

  • Configuring and Tuning Spark jobs
    • Configure Executor and Driver memory
    • Configure number of cores in Master and Worker
    • Explore the Spark Jobs UI and monitor jobs
    • Importance of persistence and caching
    • Why shuffles are expensive?
    • Using broadcast variables
  • Lightning talk(s) on best practices
  • Apache Spark hands-on tutorials
    • Work with existing large datasets
    • Optimizing code based on Spark internals

Level:

Intermediate - Advanced

Prerequisites:

  • A machine with a unix based OS or a virtual environment supporting one
  • Familiarity with command line
  • Some experience programming in Python
  • All hands-on examples and projects will be executed on distributed Spark clusters on AWS and the environment will be pre-configured for everyone

Meet Your Instructor:

David Drummond | Director of Engineering | Insight Data Science

David_Drummond.jpg

The workshop is lead by David Drummond, Director of Engineering at Insight Data Science. At Insight, he enjoys breaking down difficult concepts and helping others learn distributed technologies in a concise way. His current focus is on database internals, fault tolerance, and understanding how distributed systems fail. Before working with data, he received his PhD in Physics researching fault-tolerant systems in quantum computing.

 

New Call-to-action