Trk 2 Pt 1: Spark Internals [Workshop]

Go beyond "hellow world" in this part one Spark Internals workshop led by David Drummond, who will teach you how to configure executor and driver memory, use broadcast variables, and optimize code based on Spark internals, before jumping into practical, hands-on tutorials in part two.

In this workshop, you will learn:

Configuring and Tuning Spark jobs

Configure Executor and Driver memory
Configure number of cores in Master and Worker
Explore the Spark Jobs UI and monitor jobs
Importance of persistence and caching
Why shuffles are expensive?
Using broadcast variables

Lightning talk(s) on best practices
Apache Spark hands-on tutorials
- Work with existing large datasets
- Optimizing code based on Spark internals

Level:

Intermediate - Advanced

Prerequisites:

A machine with a unix based OS or a virtual environment supporting one
Familiarity with command line
Some experience programming in Python
All hands-on examples and projects will be executed on distributed Spark clusters on AWS and the environment will be pre-configured for everyone

Meet Your Instructor:

David Drummond | Director of Engineering | Insight Data Science

The workshop is lead by David Drummond, Director of Engineering at Insight Data Science. At Insight, he enjoys breaking down difficult concepts and helping others learn distributed technologies in a concise way. His current focus is on database internals, fault tolerance, and understanding how distributed systems fail. Before working with data, he received his PhD in Physics researching fault-tolerant systems in quantum computing.

Workshop Pt. 1: Spark Internals -

(Track 2: Spark Internals & Streaming)

Level:

Prerequisites:

Meet Your Instructor:

David Drummond | Director of Engineering | Insight Data Science

Discover new data events & more

About DataEngConf

Learn More

Connect With Us

#DataEngConf