Building a Modern Payment Stack: Lessons from Scaling to $10B
How we rebuilt our payment infrastructure to handle 10x growth while maintaining 99.99% uptime. A deep dive into distributed systems, database sharding, and real-time fraud detection.
Marcus Chen
CTO at PayStream • January 10, 2025
When I joined PayStream in 2019, we were processing about $50 million in transactions per month. Today, we handle over $10 billion. This 200x growth didn't happen by accident—it required a complete rethinking of our technical architecture.
In this post, I'll share the key lessons we learned while scaling our payment infrastructure, the mistakes we made along the way, and the patterns that have proven most valuable.
The Breaking Point
In early 2021, we hit a wall. Our monolithic architecture, which had served us well in the early days, was showing serious cracks. We were experiencing:
- Database connection pool exhaustion during peak hours
- Cascading failures when a single service became overloaded
- 6+ hour deployment cycles with significant downtime risk
- P99 latencies exceeding 2 seconds during high-traffic periods
Something had to change. We made the decision to rebuild our core payment processing infrastructure from the ground up.
Architecture Principles
Before writing any code, we established three non-negotiable principles:
1. Fail Gracefully, Not Catastrophically
Payment systems must handle failure as a first-class concern. When a downstream service fails, the system should degrade gracefully rather than cascade into complete failure.
"In a distributed system, failure is not a possibility—it's a certainty. Design for it."
2. Every Transaction is Sacred
We implemented what we call "exactly-once semantics" for payment processing. Using idempotency keys and distributed locks, we ensure that no matter what happens, a payment is processed exactly once.
async function processPayment(request: PaymentRequest) {
const lockKey = `payment:${request.idempotencyKey}`;
// Acquire distributed lock
const lock = await redis.acquireLock(lockKey, {
ttl: 30000,
retries: 3
});
try {
// Check if already processed
const existing = await db.findPayment(request.idempotencyKey);
if (existing) return existing;
// Process the payment
const result = await paymentProcessor.charge(request);
// Persist with transactional guarantee
await db.savePayment(result);
return result;
} finally {
await lock.release();
}
}
3. Observability is Non-Negotiable
You can't fix what you can't see. We instrumented everything—every API call, database query, and external service interaction. Our observability stack includes:
- Distributed tracing across all services using OpenTelemetry
- Real-time metrics with sub-second granularity
- Structured logging with correlation IDs for debugging
- Anomaly detection using ML models trained on historical patterns
Database Sharding Strategy
Our biggest technical challenge was the database. A single PostgreSQL instance simply couldn't handle our write throughput. We implemented horizontal sharding using merchant ID as the partition key.
The key insight was choosing a sharding key that would distribute load evenly while keeping related data together. Merchant ID was perfect because:
- Transactions for a single merchant are always on the same shard
- Large merchants naturally spread across shards due to our hashing algorithm
- Cross-shard queries are rare in our use case
Real-Time Fraud Detection
Processing 50,000 transactions per second means we have less than 20 milliseconds to decide if a transaction is fraudulent. Our ML-based fraud detection system uses a combination of:
- Rule-based filters for known fraud patterns
- Gradient boosted models for complex pattern recognition
- Graph neural networks for detecting fraud rings
- Real-time feature computation using stream processing
The system runs in under 5ms for 99% of transactions while catching over 98% of fraudulent activity.
Results
After 18 months of work, the results speak for themselves:
- 99.99% uptime over the past year
- P99 latency under 200ms (down from 2+ seconds)
- Zero-downtime deployments multiple times per day
- 50,000 TPS capacity with room to grow
Key Takeaways
If you're building or scaling a payment system, here's what I'd recommend:
- Design for failure from day one. It's much harder to retrofit resilience into an existing system.
- Invest heavily in observability. The cost is nothing compared to the debugging time you'll save.
- Shard early, but shard wisely. Choose your partition key carefully—changing it later is extremely painful.
- Build incrementally. We didn't rebuild everything at once. We migrated service by service over 18 months.
Building payment infrastructure is hard, but it's also incredibly rewarding. If you're interested in tackling these challenges with us, we're hiring.
Marcus Chen
CTO at PayStream
Marcus leads engineering at PayStream. Previously, he built payment infrastructure at Stripe and Square. He's passionate about distributed systems and making complex technology accessible.