The Growth Curve
A B2B payments API processing 800K requests/day in January hit 2.4M by April and 6M by July. Traditional vertical scaling (bigger instances) bought us weeks, not months. The monolith's shared-memory architecture made horizontal scaling impossible.
The Zero-Downtime Constraint
Payment processing cannot stop. No maintenance windows. No "deploy and pray." Every architectural change had to happen transparently to customers while handling live transaction volume.
Service Extraction
We decomposed the monolith into payments, accounts, and fraud services. Each domain now owns its data store and scaling policies.
Three bounded contexts · independent deploysStateless Front Door
Session state moved to JWT tokens with Redis-backed refresh, so we could scale application nodes horizontally without coordination.
JWT auth · Redis session fabricFeature-Flagged Strangler Fig
API gateway routing stayed under LaunchDarkly control. Dual writes validated new services before traffic ramped past 10%, then 50%, then 100%.
Zero downtime · instant rollbackHow We Decomposed the Monolith
Domain-Driven Service Boundaries
We identified three bounded contexts in the monolith: payment processing, account lifecycle, and fraud detection. Each became an independent service with its own database. The monolith's shared PostgreSQL instance was the bottleneck—connection pool exhaustion at 250 connections meant we couldn't scale horizontally.
After: 3 services, 3 DBs, 750 total connections
Payment Service: RDS PostgreSQL (100 conn)
Account Service: Aurora PostgreSQL (200 conn)
Fraud Service: DynamoDB (unlimited)
Result: 3x connection capacity, independent scaling
Event-Driven Communication
Services communicate via SNS/SQS event bus instead of synchronous HTTP calls. Payment events trigger account updates asynchronously. Fraud checks happen in parallel. Decoupled services meant one slow service couldn't cascade failures to others.
1. Payment service validates & writes to DB
2. Publishes PaymentCreated event to SNS
3. Returns 202 Accepted (18ms)
4. Account service consumes event (async)
5. Fraud service consumes event (async)
No blocking calls, no cascading timeouts
Strangler Fig Pattern with Feature Flags
We built new services behind feature flags in the API gateway. Traffic routing was controlled via LaunchDarkly—0% to new service, validate correctness with dual writes, ramp to 10%, 50%, 100%. Rollback was instant. The monolith stayed live until each service proved itself at full load.
if feature_flag('payment_service_v2') > random():
route to new Payment Service
else:
route to Monolith
Dual-write validation for 2 weeks at 10%
Full cutover after 6 weeks of gradual ramp
Six Months of Architectural Evolution
Mapped bounded contexts, sketched API contracts, and traced cross-service calls to see where coupling lived. Defined event schemas before writing code.
Payment logic moved to its own service and database, shipped behind a feature flag at 0%. Dual-write validation compared monolith vs service responses before we ramped to 10% traffic.
Accounts service came online with SNS/SQS fan-out. Payments now publishes events instead of calling accounts synchronously, letting each service scale separately.
Fraud detection moved to DynamoDB-backed service. After two weeks of shadow mode we drained the monolith entirely and handled 10M+ requests on the new mesh.
What Changed
Monolithic Architecture
Distributed Services
What We Actually Learned
Strangler Fig Beats Big Bang Every Time
We considered a full rewrite. It would have taken 18 months and risked everything. Extracting services gradually kept the monolith alive until each new service proved itself.
Shared Databases Are the Real Coupling
As long as services shared the monolith's database, we couldn't scale independently. The breakthrough was giving each service its own persistence layer.
Event-Driven Architecture Eliminates Cascading Failures
Synchronous service-to-service calls created tight coupling. SNS/SQS events decoupled everything—payments publish once, downstream services act asynchronously.
Dual-Write Validation Builds Confidence
Shadow traffic surfaced discrepancies we would have missed. By the time we ramped to 100%, we trusted the new services because the data said so.