Skip to content
Engineering 12 min read

Building Real-Time Data Pipelines at Scale

How we architected DataForge's streaming engine to handle 2 million events per second with sub-100ms latency, and the lessons we learned along the way.

Sarah Chen

Sarah Chen

Principal Engineer • December 18, 2024

Data center with streaming visualization
Real-time data streaming visualization at scale

When we first started building DataForge, our streaming engine could handle about 10,000 events per second. Fast enough for a demo, nowhere near enough for production workloads. Today, we process over 2 million events per second for customers like Stripe, Shopify, and Airbnb. This is the story of how we got there.

The Architecture Challenge

Real-time data pipelines have fundamentally different requirements than batch processing. You need guaranteed ordering, exactly-once semantics, and the ability to handle backpressure gracefully. Most importantly, you need to do all of this with sub-100ms end-to-end latency.

"The biggest mistake teams make with streaming is treating it like batch processing with shorter intervals. It's a fundamentally different paradigm that requires different thinking."

Our Three-Layer Approach

We settled on a three-layer architecture that separates concerns while maintaining low latency:

Ingestion Layer

Handles initial event capture with automatic schema detection and validation at wire speed.

Processing Layer

Transforms, enriches, and routes events using a DAG-based execution engine.

Delivery Layer

Guarantees exactly-once delivery to 200+ destinations with automatic retry and DLQ.

Code Example: Defining a Pipeline

Here's how you define a real-time pipeline in DataForge using our declarative YAML config:

pipeline:
  name: user-events-streaming
  mode: real-time

source:
  type: kafka
  config:
    brokers: ${KAFKA_BROKERS}
    topic: user-events
    group: dataforge-pipeline

transforms:
  - type: filter
    condition: event.type IN ('click', 'purchase')
  - type: enrich
    lookup: user_profiles
    key: event.user_id

sink:
  type: snowflake
  config:
    warehouse: ANALYTICS_WH
    database: EVENTS
    schema: STREAMING

Performance Results

After implementing this architecture, we saw dramatic improvements across all key metrics:

2M+
Events/sec
<50ms
P99 Latency
99.99%
Uptime
0
Data Loss Events

Key Takeaways

  1. Separate concerns early. The three-layer approach let us scale each component independently.
  2. Backpressure is your friend. Proper backpressure handling prevented cascading failures during traffic spikes.
  3. Test with production-scale data. Synthetic benchmarks don't reveal the bugs that real-world data distributions expose.
  4. Monitor everything. Every stage of the pipeline emits metrics. You can't optimize what you can't measure.

If you're building real-time data pipelines and want to skip the months of infrastructure work, try DataForge free. We've done the hard engineering so you don't have to.

Sarah Chen

Sarah Chen

Principal Engineer at DataForge

Sarah leads the streaming infrastructure team at DataForge. Previously at Google Cloud Dataflow and Apache Beam. She's passionate about making distributed systems accessible to every engineering team.

Related Articles