Building Real-Time Data Pipelines at Scale
How we architected DataForge's streaming engine to handle 2 million events per second with sub-100ms latency, and the lessons we learned along the way.
Sarah Chen
Principal Engineer • December 18, 2024
When we first started building DataForge, our streaming engine could handle about 10,000 events per second. Fast enough for a demo, nowhere near enough for production workloads. Today, we process over 2 million events per second for customers like Stripe, Shopify, and Airbnb. This is the story of how we got there.
The Architecture Challenge
Real-time data pipelines have fundamentally different requirements than batch processing. You need guaranteed ordering, exactly-once semantics, and the ability to handle backpressure gracefully. Most importantly, you need to do all of this with sub-100ms end-to-end latency.
"The biggest mistake teams make with streaming is treating it like batch processing with shorter intervals. It's a fundamentally different paradigm that requires different thinking."
Our Three-Layer Approach
We settled on a three-layer architecture that separates concerns while maintaining low latency:
Ingestion Layer
Handles initial event capture with automatic schema detection and validation at wire speed.
Processing Layer
Transforms, enriches, and routes events using a DAG-based execution engine.
Delivery Layer
Guarantees exactly-once delivery to 200+ destinations with automatic retry and DLQ.
Code Example: Defining a Pipeline
Here's how you define a real-time pipeline in DataForge using our declarative YAML config:
pipeline:
name: user-events-streaming
mode: real-time
source:
type: kafka
config:
brokers: ${KAFKA_BROKERS}
topic: user-events
group: dataforge-pipeline
transforms:
- type: filter
condition: event.type IN ('click', 'purchase')
- type: enrich
lookup: user_profiles
key: event.user_id
sink:
type: snowflake
config:
warehouse: ANALYTICS_WH
database: EVENTS
schema: STREAMING
Performance Results
After implementing this architecture, we saw dramatic improvements across all key metrics:
Key Takeaways
- Separate concerns early. The three-layer approach let us scale each component independently.
- Backpressure is your friend. Proper backpressure handling prevented cascading failures during traffic spikes.
- Test with production-scale data. Synthetic benchmarks don't reveal the bugs that real-world data distributions expose.
- Monitor everything. Every stage of the pipeline emits metrics. You can't optimize what you can't measure.
If you're building real-time data pipelines and want to skip the months of infrastructure work, try DataForge free. We've done the hard engineering so you don't have to.
Sarah Chen
Principal Engineer at DataForge
Sarah leads the streaming infrastructure team at DataForge. Previously at Google Cloud Dataflow and Apache Beam. She's passionate about making distributed systems accessible to every engineering team.