Building a Modern Payment Stack: Lessons from Scaling to $10B

When I joined PayStream in 2019, we were processing about $50 million in transactions per month. Today, we handle over $10 billion. This 200x growth didn't happen by accident—it required a complete rethinking of our technical architecture.

In this post, I'll share the key lessons we learned while scaling our payment infrastructure, the mistakes we made along the way, and the patterns that have proven most valuable.

The Breaking Point

In early 2021, we hit a wall. Our monolithic architecture, which had served us well in the early days, was showing serious cracks. We were experiencing:

Database connection pool exhaustion during peak hours
Cascading failures when a single service became overloaded
6+ hour deployment cycles with significant downtime risk
P99 latencies exceeding 2 seconds during high-traffic periods

Something had to change. We made the decision to rebuild our core payment processing infrastructure from the ground up.

Architecture Principles

Before writing any code, we established three non-negotiable principles:

1. Fail Gracefully, Not Catastrophically

Payment systems must handle failure as a first-class concern. When a downstream service fails, the system should degrade gracefully rather than cascade into complete failure.

"In a distributed system, failure is not a possibility—it's a certainty. Design for it."

2. Every Transaction is Sacred

We implemented what we call "exactly-once semantics" for payment processing. Using idempotency keys and distributed locks, we ensure that no matter what happens, a payment is processed exactly once.

async function processPayment(request: PaymentRequest) {
  const lockKey = `payment:${request.idempotencyKey}`;

  // Acquire distributed lock
  const lock = await redis.acquireLock(lockKey, {
    ttl: 30000,
    retries: 3
  });

  try {
    // Check if already processed
    const existing = await db.findPayment(request.idempotencyKey);
    if (existing) return existing;

    // Process the payment
    const result = await paymentProcessor.charge(request);

    // Persist with transactional guarantee
    await db.savePayment(result);

    return result;
  } finally {
    await lock.release();
  }
}

3. Observability is Non-Negotiable

You can't fix what you can't see. We instrumented everything—every API call, database query, and external service interaction. Our observability stack includes:

Distributed tracing across all services using OpenTelemetry
Real-time metrics with sub-second granularity
Structured logging with correlation IDs for debugging
Anomaly detection using ML models trained on historical patterns

Database Sharding Strategy

Our biggest technical challenge was the database. A single PostgreSQL instance simply couldn't handle our write throughput. We implemented horizontal sharding using merchant ID as the partition key.

The key insight was choosing a sharding key that would distribute load evenly while keeping related data together. Merchant ID was perfect because:

Transactions for a single merchant are always on the same shard
Large merchants naturally spread across shards due to our hashing algorithm
Cross-shard queries are rare in our use case

Real-Time Fraud Detection

Processing 50,000 transactions per second means we have less than 20 milliseconds to decide if a transaction is fraudulent. Our ML-based fraud detection system uses a combination of:

Rule-based filters for known fraud patterns
Gradient boosted models for complex pattern recognition
Graph neural networks for detecting fraud rings
Real-time feature computation using stream processing

The system runs in under 5ms for 99% of transactions while catching over 98% of fraudulent activity.

Results

After 18 months of work, the results speak for themselves:

99.99% uptime over the past year
P99 latency under 200ms (down from 2+ seconds)
Zero-downtime deployments multiple times per day
50,000 TPS capacity with room to grow

Key Takeaways

If you're building or scaling a payment system, here's what I'd recommend:

Design for failure from day one. It's much harder to retrofit resilience into an existing system.
Invest heavily in observability. The cost is nothing compared to the debugging time you'll save.
Shard early, but shard wisely. Choose your partition key carefully—changing it later is extremely painful.
Build incrementally. We didn't rebuild everything at once. We migrated service by service over 18 months.

Building payment infrastructure is hard, but it's also incredibly rewarding. If you're interested in tackling these challenges with us, we're hiring.

Building a Modern Payment Stack: Lessons from Scaling to $10B

The Breaking Point

Architecture Principles

1. Fail Gracefully, Not Catastrophically

2. Every Transaction is Sacred

3. Observability is Non-Negotiable

Database Sharding Strategy

Real-Time Fraud Detection

Results

Key Takeaways

Marcus Chen

Related articles

How We Built Our ML Fraud Detection System

PCI DSS 4.0: What You Need to Know

Introducing PayStream API v3