Scaling CDC to Trillions of Rows: What Broke, What We Rebuilt, and What AI Demands Next

Technical Talks

Most CDC pipelines work fine when you're building an MVP. Ours did too - until they didn't. Artie is a real-time data replication platform that processes 20-30 billion events per day across thousands of pipelines with sub-minute latency on 90% of them. Three years ago we were running a forked version of Debezium with Kafka processing millions of rows. Along the way, many assumptions we started with broke.

This talk is a post-mortem of what failed, what we rebuilt, and the decisions that matter at scale:

Why we replaced Debezium - single-threaded capture, limited extensibility, and no built-in recovery forced us to build a proprietary Reader from scratch to increase fault tolerance
Parallel backfills without data loss - running historical loads alongside live CDC using primary-key range chunking and exactly-once merge semantics, following Netflix's DBLog pattern
Fan-in from thousands of single-tenant databases - consolidating sharded or single-tenant sources into unified destination schemas without bespoke ETL per tenant
Edge cases at scale - five-digit-year timestamps, negative years, non-JSON in JSONB, non-UTF8 encodings, and why we chose to fail hard rather than silently skip data (and the recovery mechanisms that make that practical in production)
Schema evolution - automatic column adds, type changes, drops, and notifications so teams know what changed

Finally: AI workloads have the same freshness problem databases have always had, but the sources are no longer just databases - they are filesystems, object stores, git repos, and documents. We will share how Artie is extending its core primitives beyond databases to become the sync layer for any data AI systems depend on.

Attendees will leave with concrete architectural patterns for building CDC systems that survive at scale, a checklist of failure modes, and a framework for thinking about real-time data as AI infrastructure.

Robin Tang

Co-founder & CTO | Artie

Robin Tang is the Co-Founder and CTO of Artie, a real-time data replication platform that moves data across databases, warehouses, and lakes. Before founding Artie, Robin built data infrastructure at scale and saw firsthand how brittle existing CDC tooling became under production load. At Artie, he leads the engineering team that replaced Debezium with a fully custom streaming architecture now processing trillions of rows for customers including Substack, ClickUp, and Alloy. Robin writes and speaks about the practical challenges of data replication - schema evolution, transactional integrity, and the edge cases that only surface at scale.

Technical Talks

Scaling CDC to Trillions of Rows: What Broke, What We Rebuilt, and What AI Demands Next

FEATURED MEETINGS

Follow / Join Us

Contact Us

Menu