Building a Self-Service Platform for Continuous, Real-Time Feature Generation for Machine Learning

At Lyft, all our systems, including client applications generate many millions of events per second. These events are ingested by the event ingestion pipeline and streamed through Kinesis and Kafka and also available in persistent stores such as Hive for offline consumption.

This data can be used to generate features for ML models as well as for other kinds of real time decision making. Our Research Scientists and Data Scientists have come up with algorithms to get features from data. However, the challenge lies in doing this quickly, correctly, effectively and reliably. For this we have built a self service platform using Flink, Beam and Kubernetes that can be used to write, prototype and deploy stateful computations on high throughput streaming data.

With this platform we have tried to abstract out the challenges of dealing with provisioning, data discovery, bootstrapping, skew, late arriving and unordered events, downtime etc, so that our experts can focus on what they do best without having to worry about what goes on behind the scenes.

In this talk I will be discussing the architecture, key takeaways, lessons learned as well as wins!

Software Engineer

Sherin Thomas

Lyft

Sherin Thomas is a Software Engineer at Lyft. Currently she's building a self-serve, real-time feature generation platform for Machine Learning usecases, using Apache Flink, Beam, Kafka.

2026 Talks

Building a Self-Service Platform for Continuous, Real-Time Feature Generation for Machine Learning

The AI Conference for Humans Who Ship