ABOUT THE TALK

Before any analysis can begin, a data scientist needs to discover the right data sources to analyze, understand them and gain trust in them. Perhaps, the same or a similar analysis has been previously done which can be leveraged.

Unfortunately, the data discovery is very inefficient today. Countless hours get lost trying to find the right data to use the most common way still remains to ask a coworker. Gaining trust in data requires running a bunch of queries – max timestamp, counts per day, count distincts, etc. that waste time and add unnecessary load on the databases. There's a no clear way to know how to find folks to answer questions about the table. And, worst of all, many times analysis is redone, models are rebuilt because previous work is not discoverable.

In this talk, we discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Lyft has seen time spent on data discovery reduce 10 fold because of Lyft’s data portal – Amundsen.

Amundsen is built on 3 key pillars:

1. Augmented Data Graph

Amundsen uses a graph database under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that we treat people as a first class data asset's in other words, there's a graph node for each person in the organization that connects to other nodes (like tables, and dashboards).

2. Intuitive User Experience

Amundsen runs PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.

3. Centralized Metadata

Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress.

We will give a demo of Amundsen, its goals, deep dive into Amundsen's architecture and discuss how it achieves the 3 discussed design pillars. We will close with future roadmap of the project, what problems remain unsolved and how we can work together to solve them through open sourcing the project.

Tao Feng

Engineer | Lyft

Tao Feng is a software engineer at Lyft data platform team working on various data products. Tao is a committer and PMC on Apache Airflow. Previously, Tao worked at Linkedin and oracle on data infrastructure, tooling and performance.

Tao Feng
BUY TICKETS


VIEW ON MAP

Location subheader text. Can be left blank if not needed.

Company Name

Company address, lorem ipsum dolor sit amet

BROUGHT TO YOU BY:

partner-85.png
partner-canvas.png
partner-dropbox.png

FEATURED MEETINGS