TechAni

Case study

Managed care platform scaling during open enrollment

A large managed care organization supporting millions of members across Medicaid and marketplace plans. One annual open enrollment window. Six weeks of traffic the platform was never sized for. This case study documents the failure patterns we encountered, the decisions we made, and what the infrastructure looked like on the other side.

Published Mar 16, 2026

The situation

Most teams build for 2x. The architecture works, the database holds, the pods come up. Then a product launch, a seasonal spike, or a viral event hits - and suddenly you are staring at 8x, 12x, 15x baseline. That is when the design assumptions that nobody wrote down come back to haunt you.

The challenge is not the 15x itself. A well-designed system should handle it. The challenge is that every component in your stack - compute, networking, data, API, cache, messaging - has a different capacity cliff. They do not all fall over at the same load. They fall over in sequence, which turns a predictable scaling problem into a rolling incident chain.

This article breaks down each layer - what breaks first, why, and what the remediation looks like in practice. The framing is AWS and GCP because those are where most engineering teams actually operate, and the service primitives are different enough to matter.

Scope This is not about infinite scale or greenfield rewrites. It is about taking an existing system - one that was designed sensibly at 1x - and hardening it to survive a genuine 15x traffic event with acceptable degradation and recovery time.

How load scaled during enrollment

Open enrollment does not spike all at once. It ramps. The first week is manageable - returning members, early filers. By week two the employer group deadlines hit. By week three the ACA deadline anxiety sets in and load becomes genuinely unpredictable. Here is how it actually played out in 2024.

Enrollment period Load vs baseline Risk profile What we saw
Baseline 1x Stable Individual component bugs, not capacity
Growth 2x - 4x Manageable Cache misses, connection pool exhaustion, under-provisioned DB
Strain 5x - 8x Elevated Queue buildup, hot partitions, API throttling, disk I/O on DBs
Stress 9x - 12x Critical Cascading failures across services, cold start storms, memory pressure
Survival 13x - 15x Incident External dependency failures, DNS degradation, network saturation

The ACA deadline week was the critical window. We knew it was coming. We had two weeks of enrollment data to model what was approaching. We still had a P1 incident. The issue was not that we did not scale - we did. The issue was the sequence of failures that emerged as each layer hit its limit one after another, faster than we could respond reactively.

What broke and why: failure patterns across the stack

Six distinct failure categories surfaced during enrollment. Some we caught in pre-enrollment load testing. Others we found live. Here is how they mapped to what we actually experienced.

Resource contention
CPU, memory, disk, or network bandwidth gets fully consumed. Requests queue or fail. Classic at the compute and DB layers.
Connection pool exhaustion
Finite connection limits - database connections, thread pools, socket limits - fill up. New requests hang waiting for a slot. Hard to detect without the right metrics.
Hot partitions
Uneven key distribution routes disproportionate load to a single shard or partition. Happens in Kafka, DynamoDB, Redis, and any hash-routed system.
Fan-out amplification
One inbound request generates many downstream calls. At 1x this is invisible. At 15x it turns into a multiplier that overwhelms dependencies.
Cascading failure
A slow dependency causes caller timeouts, which back up threads, which cause memory pressure, which triggers OOM kills. One failure becomes systemic.
Cold start penalty
Scale-out events bring new pods or functions online, but they are not warm. Under heavy load the startup latency makes them useless during the spike they were meant to absorb.
The compounding problem During the November 15th P1, three of these fired simultaneously. The plan eligibility event queue backed up (hot partition), which caused the enrollment API connection pool to fill, which triggered a cold start storm as the eligibility pods restarted under load. By the time our Dynatrace alert fired, we were already 8 minutes into user-facing errors.

Compute failures: the cold start storm on Nov 15

The eligibility service is a Spring Boot application. It loads a member eligibility cache from DynamoDB on startup - around 12 seconds of initialization before it can serve traffic. Under normal conditions this is invisible. During the November 15th traffic surge, HPA fired and spun up 40 new eligibility pods. All 40 were in the 12-second boot window simultaneously while the existing 18 pods were fully saturated. The result was 45 seconds of degraded eligibility checks across the member portal.

Container-based workloads (Kubernetes)

The Kubernetes scaling story has two components that need to work together: the Horizontal Pod Autoscaler (HPA) for pod count and the Cluster Autoscaler (or Karpenter on AWS) for node capacity. Most teams configure HPA correctly and then discover that their cluster autoscaler is too slow when the nodes are not there.

  • Set HPA targets based on a combination of CPU and custom application metrics - pure CPU scaling lags too much under bursty traffic.
  • Set minReplicas high enough to absorb the first 2x of traffic without triggering a scale event. Cold starts during the opening wave are the worst time to be provisioning capacity.
  • Use pod disruption budgets (PDBs) to prevent autoscaling from evicting too many replicas during a scale-down after the spike.
  • Set stabilizationWindowSeconds on scale-down aggressively. Scale up fast, scale down slow. Thrashing after a spike is its own incident.
  • On AWS, Karpenter with spot instance consolidation handles node provisioning faster than the legacy Cluster Autoscaler. On GCP, GKE Autopilot handles this natively.

The cold start storm

A cold start storm is what happens when your autoscaler fires correctly but your new pods are useless for the first 30-90 seconds of their life - and those 30-90 seconds are exactly when you need them. It is common in JVM-based services (Spring Boot, Scala) and anything that pre-warms a local cache on startup.

The failure shape: traffic spikes, HPA fires, 40 new pods begin starting, the existing 20 pods are saturated, latency climbs, error rate climbs. Forty-five seconds later the new pods come online. The autoscaler worked perfectly. The system still had an incident.

Cold start storm - Spring Boot service, 45s startup time
T+0s T+10s T+20s T+30s T+45s T+60s T+75s traffic SPIKE: 8x baseline traffic incoming capacity existing 20 pods: SATURATED (handling 8x alone) 40 new pods LIVE ✓ errors 5xx rate: 12-28% during the gap <0.1% ✓ new pods JVM init (10s) Spring context (12s) cache warmup (8s) READY ✓ pod ready spike 45s gap between spike and new capacity. Users see errors the entire time. Pre-warming eliminates this entirely.
Before: single readiness probe, no startup or liveness
# One probe doing everything wrong.
# No startup probe - traffic routes in at 5s.
# No liveness probe - stuck pods never restart.
# /health checks a dependency, not app state.

readinessProbe:
  httpGet:
    path: /health   # checks DB conn - too broad
    port: 8080
  initialDelaySeconds: 5   # Spring still loading
  periodSeconds: 5

# At 5s the app is mid-boot. It accepts the
# request, tries to hit a bean that isn't
# initialized yet, and throws 500s.
# No liveness probe means a deadlocked pod
# stays in rotation indefinitely.
After: startup + readiness + liveness, all doing different jobs
# Three probes. Each does a different job.
# Startup: gate traffic until warmup is done.
# Readiness: remove from LB if temporarily unhealthy.
# Liveness: restart if stuck / deadlocked.

startupProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  failureThreshold: 30
  periodSeconds: 3   # up to 90s for warmup

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3   # restart after 30s unresponsive

# Liveness and readiness are separate endpoints in Spring.
# Never point liveness at a dependency check -
# one slow DB response should not restart your pod.
Pre-warming eliminates the gap Set minReplicas to 2x your baseline pod count. The idle cost of carrying extra warm pods is almost always less than the cost of one cold-start-driven incident per quarter.

Serverless and function-based compute

Two lower-tier services - tax credit calculation and address validation - were running as Lambda functions behind API Gateway. These hit the default regional concurrency limit (1,000) on November 15th within 6 minutes of the peak onset. Tax credit calculation started returning 429s to the enrollment API, which propagated into failed plan submissions.

AWS Lambda
  • Default regional concurrency limit is 1,000. Raise it proactively. At 15x you will hit it.
  • Use provisioned concurrency for latency-sensitive paths. Eliminates cold starts at the cost of always-on spend.
  • Lambda SnapStart (Java) eliminates JVM init cold start - matters if your functions are JVM-based.
  • SQS + Lambda with reserved concurrency lets you absorb queue depth without hammering downstream dependencies.
GCP Cloud Run
  • Set a minimum instance count to keep warm instances alive at baseline. Going from zero to one under load adds 2-4 seconds of cold start latency at exactly the worst moment.
  • Cloud Run's CPU is only allocated during request handling by default. Enable CPU always-on for background tasks that need it.
  • Concurrency per instance defaults to 80. Tune this based on your workload's memory footprint vs throughput.
  • Cloud Run paired with Pub/Sub push subscriptions handles async event-driven scale cleanly.
Compute scaling rule of thumb Pre-warm to 3x baseline before a known traffic event. At 15x, the time it takes to provision new capacity is time your existing capacity is absorbing overload. Do not rely on reactive autoscaling alone for planned events.

Streaming failures: the enrollment event queue

Enrollment events - plan selections, dependent additions, coverage confirmations - flow through a Kafka topic called enrollment-events. Every enrollment action a member takes produces one or more events. Downstream consumers handle eligibility verification, document generation, and CRM sync. During peak enrollment, the event volume is the first thing that backs up.

Kafka

At 1x, Kafka partition count is a footnote. At 15x, it determines your maximum consumer parallelism. Partition count is fixed at topic creation - you can add partitions but the rebalancing cost and key-ordering implications are significant. Size it generously from the start.

The most common failure mode is a consumer lag storm. Traffic spikes, producers publish faster than consumers can process, lag accumulates, and consumers fall behind by hours. If you have alerts on lag, this surfaces quickly. If you do not, you find out when downstream systems start serving stale data or timing out.

  • Partition count should be at least 2x your expected peak consumer parallelism. This gives you room to scale consumer group size without repartitioning.
  • Watch for hot partitions when your partition key has low cardinality - all messages for a given user or tenant routing to the same partition is a classic trap.
  • Tune max.poll.records and session.timeout.ms together. Under high load, a consumer that processes too slowly will hit its session timeout, trigger a full group rebalance, and make the lag problem significantly worse before it gets better.
  • Use a dead-letter topic for poison messages. A single malformed message in a Kafka partition will stall that partition's consumers indefinitely if you do not handle it.

Edge case: the hot partition trap

Our enrollment-events topic had 12 partitions, keyed on plan_id. The assumption was that enrollment events would distribute roughly evenly across plans. They do not. During the employer group deadline week, three large employer groups - all enrolled in the same PPO plan - submitted simultaneously. Every event from those groups routed to a single partition. That partition's consumer hit 100% CPU. Lag climbed to 390,000 messages. Downstream eligibility verification stalled. Members who completed enrollment did not receive their confirmation documents for over 90 minutes.

The fix was a composite partition key: member_id + event_type. Applied after the November 8th incident. No hot partitions during the November 15th peak despite higher total volume.

Incident timeline - enrollment-events topic, plan_id hot partition, Nov 8 employer deadline
T+0:00Baseline50 events/sec, lag ~0, 12 consumers healthy
T+0:02Spike beginsTenant X batch starts. 800 events/sec, all to partition 7
T+0:05Consumer peggedP-7 consumer at 100% CPU. Lag: 12,000 msgs
T+0:08Alert firesConsumer lag > 10k. Lag: 41,000. Downstream stale.
T+0:14Session timeoutSlow consumer hits session.timeout.ms. Group rebalance.
T+0:17Rebalance stormAll 12 consumers paused. Lag: 390,000 msgs
T+0:31Recovery beginsTenant batch ends. Consumers drain. Full clear: T+1:40
390Kpeak lag (msgs)
17 minrebalance onset
100 minfull recovery
11/12idle consumers at peak
Before vs after: enrollment event partition key redesign
BAD: tenant_id as key Producer key=tenant_X P-0 (idle) P-3 (idle) P-7 800 e/s 🔥 P-11 (idle) Consumer ⚠ 100% lag: 390,000 msgs One consumer absorbs all load. Timeout → rebalance → all consumers pause. GOOD: hash(tenant_id + event_type) Producer key=hash(X+type) P-0 ~200 e/s P-3 ~200 e/s P-7 ~200 e/s P-11 ~200 e/s lag: <200 msgs Load distributed. Any consumer handles 4x spike without falling over.
Bad - low cardinality key
# All events for one tenant → one partition

def get_partition_key(event):
    return event["tenant_id"]

# acme-corp hits 800 events/sec at month-end.
# All route to P-7. 11 consumers idle.
# Session timeout → rebalance → 3 min outage.
Good - composite hash key
import hashlib

def get_partition_key(event, num_partitions=12):
    raw = f"{event['tenant_id']}:{event['type']}"
    h = int(hashlib.md5(raw.encode()).hexdigest(), 16)
    return h % num_partitions

# acme-care:PLAN_SELECTION, acme-care:DEPENDENT_ADD,
# acme-care:DOC_UPLOAD → different partitions.
# 800 enrollment events/sec spread across 4+ partitions.

AWS Kinesis

Kinesis scales through shards, and each shard gives you 1 MB/s write and 2 MB/s read. At 15x, you need to have pre-calculated how many shards that translates to for your payload sizes, and you need to verify your application does not have write-hotspotting on a subset of shards.

Kinesis enhanced fan-out eliminates the 2 MB/s read limitation for critical consumers by giving each consumer their own dedicated throughput. Use it for latency-sensitive downstream consumers.

GCP Pub/Sub

Pub/Sub handles scale more transparently than Kafka or Kinesis - Google manages the partition layer. The scaling concern shifts to your subscription configuration and consumer fleet. Undelivered message backlog caps at 10GB per subscription by default. Under extreme spike conditions, messages start being dropped if backlog grows beyond retention settings.

Streaming back-pressure The common failure pattern at 10x+ is producers outpacing consumers by enough that backlog exceeds retention. Messages start expiring before they are processed. This is a data loss event, not just a latency event. Size your consumer fleet to handle 15x with headroom, and set retention higher than you think you need.

Database failures: the connection wall and the eligibility bottleneck

The database is almost always the hardest layer to scale horizontally. Compute is stateless and elastic. Databases are stateful, and statefulness fights elasticity at every step. At 15x, your database scaling strategy either works or it does not, and there is very little you can do reactively once load is on the system.

Relational databases

Read replicas are the first line of defense. Route read traffic to replicas, protect the primary for writes. The failure mode here is replication lag - under write-heavy load, replicas fall behind the primary, and reads from the replica serve stale data. Monitor replica lag as a primary SLO component, not an afterthought.

Connection pooling is essential at scale. At 15x traffic with a naive connection-per-request pattern, you saturate the database's max connection limit before CPU becomes a concern. PgBouncer for Postgres, ProxySQL for MySQL, and RDS Proxy on AWS (Cloud SQL Auth Proxy on GCP) sit in front of your database and multiplex application connections onto a smaller pool of database connections.

AWS RDS / Aurora
  • Aurora auto-scaling read replicas can add capacity in minutes. Wire this up before a spike event, not during.
  • Aurora Serverless v2 scales Aurora capacity units (ACUs) up and down in fine-grained increments. Useful for unpredictable workloads.
  • RDS Proxy pools connections and reduces the failover impact during multi-AZ events. At high scale, its connection brokering is worth the overhead.
  • Watch for Performance Insights when diagnosing DB slowness. Wait events tell you more than CPU or IOPS in isolation.
GCP Cloud SQL / AlloyDB
  • Cloud SQL read replicas add horizontal read capacity. Cross-region replicas give you disaster recovery plus geographic read distribution.
  • AlloyDB for Postgres is built for analytical and hybrid workloads at scale. Its columnar engine handles mixed OLTP/OLAP patterns that Cloud SQL struggles with.
  • AlloyDB scales to 128 vCPUs and 864 GB RAM. For write-heavy OLTP at extreme scale, this is where Cloud SQL runs out of runway.
  • Cloud SQL Auth Proxy handles IAM-based authentication and connection management without managing connection pool credentials separately.

NoSQL and wide-column stores

DynamoDB (AWS) and Bigtable (GCP) handle horizontal scale differently from relational systems, but they introduce their own failure modes at 15x. DynamoDB's most common scaling problem is provisioned capacity mode under a spike - on-demand mode avoids this but costs significantly more at sustained high load. The trade-off is worth modeling before a traffic event.

Bigtable's performance is sensitive to row key design. A monotonically increasing row key - like a timestamp prefix - creates a write hotspot on the most recent tablet. Distribute writes across the key space with a hash prefix or a reversed timestamp.

Caching

At 15x, your cache hit rate determines whether your database survives. A cache hit rate drop from 95% to 85% under spike conditions translates to 3x the database reads for the same traffic increase. Cache stampede - where many requests simultaneously miss on the same key - is a known failure mode during traffic spikes. Use probabilistic early expiry or locking patterns to prevent simultaneous cache rebuilds.

Edge case: the connection wall

The enrollment API service was configured with a SQLAlchemy connection pool of 5 per pod, max overflow 10. At baseline with 20 pods this consumed 100-300 connections against a db.r6g.2xlarge instance with a max_connections of 500. When enrollment traffic drove HPA to scale the enrollment API to 95 pods, we had 950 potential connections attempting to reach a 500-connection limit. The first FATAL errors appeared at pod 51.

The failure mode was a cliff, not a slope. Connection #501 failed with FATAL. The enrollment API logs filled with remaining connection slots are reserved within 90 seconds. The ALB health checks still passed because the pods themselves were running - just unable to reach the database. Our error rate alert was set at 5% threshold; by the time it fired we were at 34% 5xx on enrollment submissions.

Incident timeline - Aurora connection exhaustion, Nov 15 peak, 8x baseline traffic
T+0:00Baseline20 pods × 5 pool = 100 conns. DB limit: 500. 20% used.
T+0:03Scale-outHPA fires. Pods 20 → 80. 400 connections.
T+0:09Approaching limit95 pods running. 475 conns. 95% of limit.
T+0:11Wall hitPod 101 starts. Connection #501 fails. DB errors.
T+0:12Error cascade5xx rate spikes to 34%. LB health checks still pass.
T+0:15Alert fires3 minutes of 5xx before alert threshold crossed.
500DB connection limit
101pods at failure point
34%5xx rate at peak
20real DB conns with pooler
Before vs after: enrollment API database connection architecture
WITHOUT POOLER pod-1 5x pod-2 5x pod-N 5x 100 pods total = 500 conns AT LIMIT Postgres 500/500 🔴 Pod 101 connects. FATAL error. No degradation. WITH PGBOUNCER / RDS PROXY pod-1 5x pod-2 5x pod-N 5x 500 pods OK PgBouncer 500 app → 20 DB conns Postgres 20/500 🟢 500 app conns multiplexed into 20 real DB conns.
Bad - pool grows with pod count
# Each pod opens pool directly to DB.
# 100 pods x 5 = 500 = hard limit.

engine = create_engine(
    "postgresql://db:5432/app",
    pool_size=5,
    max_overflow=10,  # 15 per pod at burst!
)
# 100 pods x 15 = 1,500 attempted.
# DB rejects at 500. FATAL at scale.
Good - small pool + RDS Proxy / PgBouncer
# App connects to proxy, not DB.
# Proxy multiplexes into small real pool.

engine = create_engine(
    "postgresql://rds-proxy:5432/app",
    pool_size=2,
    max_overflow=3,
    pool_pre_ping=True,
)
# 500 pods x 5 = 2,500 app conns
# Proxy holds 20 real DB connections.
# DB sees 20 conns regardless of pod count.

API failures: the eligibility service cascade and the tax credit 429 flood

The enrollment API had two significant failure modes at the API layer. The first was a fan-out ratio problem on the plan detail endpoint - each request triggered 8 downstream service calls, one of which hit an external tax credit calculation API with a 500 req/s rate limit. The second was a cascading failure triggered by eligibility service slowdown that propagated all the way to the member-facing portal within 4 minutes.

Rate limiting and throttling

Rate limiting at the API layer protects your backend, and it also helps your callers. A 429 with a proper Retry-After header is far better than a 503 or a timeout. Clients that can handle 429 will back off and retry. Clients that get timeouts often do not, which generates more load at the worst moment.

Implement rate limiting at the gateway layer (API Gateway on AWS, Cloud Endpoints or Apigee on GCP) rather than inside your application code. Gateway-level rate limiting happens before your compute receives the request, which means it actually protects your backend instead of letting the traffic in first.

Circuit breakers

A circuit breaker sits between your API and its dependencies. When a downstream service starts failing or slowing, the circuit opens and requests fail fast without waiting for the timeout. This prevents thread exhaustion and connection pool depletion from slow dependencies.

The half-open state is where most implementations get tricky. After the circuit opens, you need to periodically probe whether the downstream service has recovered. Sending a single probe request and reopening on success is common. A better pattern is to slowly ramp the probe rate rather than flipping from zero to full traffic immediately.

Load shedding

Load shedding is the intentional rejection of lower-priority requests when the system is at capacity. This is different from rate limiting - rate limiting is per-client, load shedding is system-wide. The implementation requires you to have a priority model for your traffic. Health checks and critical paths should never be shed. Background jobs and low-priority reads are the first to go.

  • Use API Gateway's usage plans on AWS or Apigee rate limit policies on GCP for per-client throttling.
  • Implement circuit breakers via a sidecar (Envoy, Linkerd) or a library (Resilience4j, Polly) rather than hand-rolling them.
  • Set aggressive timeouts on all downstream calls. A 30-second timeout does not protect you - it just delays the failure cascade. Most intra-service calls should time out in under 2 seconds.
  • Instrument your circuit breaker state as a metric. Knowing when circuits are open and how long they stay open is essential observability data during an incident.
  • Test your load shedding logic before you need it. Fire drill a traffic spike against a staging environment and verify that priority traffic is preserved.

Edge case: fan-out amplification at 15x

The GET /member/plan-options endpoint - the core plan comparison screen - called eight downstream services in sequence: member profile, plan inventory, network directory, formulary, cost estimator, subsidy calculator, plan rules, and the external tax credit API (two calls). At 1x this generated around 200 tax credit API calls per second. At 15x it generated 3,000. The external API's rate limit was 500 req/s. We hit it 4 minutes into the December 15th final deadline peak.

Plan options endpoint fan-out - 8:1 ratio, external tax credit API rate limit breach
GET /member/ plan-options 1x = 100 req/s 15x = 1,500/s Plan API fan-out x8 member-profile svc (1) plan-inventory svc (1) network directory (1) formulary svc (1) cost estimator (1) subsidy calc (1) plan rules svc (1) tax credit API (2) ⚠ DOWNSTREAM LOAD service1x to 15x member-profile100 to 1,500/s plan-inventory100 to 1,500/s network dir100 to 1,500/s formulary100 to 1,500/s cost estimator100 to 1,500/s subsidy calc100 to 1,500/s plan rules100 to 1,500/s tax credit API x2 ⚠200 to 3,000/s total downstream:12,000/s Tax API limit 500 req/s. At 15x the fan-out sends 3,000/s. Circuit breakers + caching required.
Bad - sequential fan-out, no cache on external
async def get_plan_options(member_id):
    # Sequential. At 15x: tax_api gets 3,000/s.
    profile    = await member_profile.get(member_id)
    inventory  = await plan_inventory.list(member_id)
    directory  = await network_directory.lookup(member_id)
    formulary  = await formulary_svc.get(member_id)
    cost       = await cost_estimator.calculate(member_id)
    subsidy    = await subsidy_calc.run(member_id)
    rules      = await plan_rules.evaluate(member_id)
    tax        = await tax_api.calculate(member_id)
    return assemble_plan_options(profile, inventory, directory, formulary, cost, subsidy, rules, tax)
Good - parallel gather + cache external calls
async def get_plan_options(member_id):
    # Parallel for independent calls.
    # Cache external API - 60s TTL cuts calls 90%+
    profile, inventory, directory, formulary, cost, subsidy, rules = \
        await asyncio.gather(
            member_profile.get(member_id),
            plan_inventory.list(member_id),
            network_directory.lookup(member_id),
            formulary_svc.get(member_id),
            cost_estimator.calculate(member_id),
            subsidy_calc.run(member_id),
            plan_rules.evaluate(member_id),
        )
    tax = await cache.get_or_fetch(
        key=f"tax:{profile.region}",
        fetch=lambda: tax_api.calculate(member_id),
        ttl=60
    )
    return assemble_plan_options(profile, inventory, directory, formulary, cost, subsidy, rules, tax)

Edge case: cascade anatomy

On November 15th at 2:14 PM EST, the eligibility service's Aurora read replica started returning queries at 800ms instead of the normal 20ms. The enrollment API had a 30-second timeout on eligibility calls. It did not fail fast. It queued threads. Thread pool filled. Within 4 minutes the cascade had propagated through the API gateway to the member portal. Members saw blank plan comparison screens and timeout errors on plan submission.

The eligibility service had no circuit breaker. The enrollment API had no circuit breaker protecting calls to eligibility. The 30-second default timeout was the only protection - and it was not protection at all. It was just a delayed failure. The fix post-incident: 1.5-second timeout, circuit breaker at 5 consecutive failures, graceful degradation to cached eligibility data when the circuit is open.

Nov 15 cascade: eligibility DB slowdown to member portal outage in 4 minutes
MEMBER PORTAL API GATEWAY ENROLLMENT API ELIGIBILITY SVC ELIGIBILITY DB T+0 200ms p99 ✓ healthy ✓ threads 12/200 ✓ 20ms ✓ 800ms query 🔥 T+1m still ok latency +80ms threads 140/200 ⚠ 800ms ⚠ still 800ms 🔥 T+4m timeouts begin ⚠ conn queue full 🔴 threads 200/200 🔴 timeouts 🔴 still 800ms 🔥 1 slow DB query propagates left. Root cause: 30s timeout + no circuit breaker.
Bad - 30s timeout, no circuit breaker
# 30s timeout. Threads fill up.
# 200 threads x 30s = pool gone in 40s.

async with httpx.AsyncClient(
    timeout=30.0  # too long - doesn't protect you
) as client:
    resp = await client.get(
        "http://eligibility-svc/check", json=member
    )
Good - tight timeout + circuit breaker
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
async def check_eligibility(member):
    async with httpx.AsyncClient(
        timeout=1.5  # tight. fail fast.
    ) as client:
        return await client.get(
            "http://eligibility-svc/check", json=member
        )

# After 5 timeouts: circuit opens.
# Requests fail immediately. No thread hold.
# Downstream gets breathing room.
The 30-second timeout trap A 30-second timeout does not protect you - it extends how long a cascade takes to play out. If eligibility-svc returns 800ms responses and your thread pool has 200 threads, you exhaust it in 40 seconds of sustained load. A 1.5-second timeout lets the circuit breaker open at failure 5 and stops the bleed. Tight timeouts are what let the circuit breaker do its job.

Ephemeral environments during enrollment

The development team was actively shipping enrollment bug fixes throughout the six-week period - sometimes multiple deploys per day. They had 12-15 open PRs at any given time during the first two weeks of enrollment, each with its own preview environment. This created a background infrastructure load that had not been accounted for in the enrollment capacity plan.

The core tension is that ephemeral environments need to be cheap enough to run at volume but realistic enough to be useful for testing. Getting this balance wrong in either direction causes problems: too cheap and they mask real scaling issues, too expensive and teams stop using them.

PR opened Trigger detected via GitHub/GitLab webhook
Provision Namespace or project spun up with templated resources
Deploy App deployed with reduced replicas, mocked dependencies
Test & review Preview URL, automated smoke tests, reviewer access
Teardown Destroyed on PR close or merge, all resources released

Edge case: cost explosion at team scale

The team had been running isolated preview environments for months - each with its own RDS instance and Redis cluster. At 15 concurrent PRs during peak enrollment, those environments were consuming capacity on the same AWS account. The enrollment period AWS bill for preview environments alone was $41,000 in November against a typical $8,000/month. Nobody had connected the bill increase to the enrollment timeline until week three. Worse, three of the preview environments had idle RDS instances that were sharing the same VPC subnet ranges as the production Aurora cluster - creating unnecessary network congestion during peak.

Enrollment period preview environment cost - before and after guardrails were enforced
$0 $10k $25k $47k Week 1 Week 4 Week 8 Week 12 Week 16 Week 18 $4k $6k $11k $27k $44k $47k/mo 🔴 TTL + shared DB $9k/mo ✓ naive (isolated DB per env) with guardrails (shared DB + TTL)
Bad - full isolated stack per PR
# terraform/preview/main.tf
# Each PR: 1 RDS + 1 Redis + 8 pods.
# No TTL. Lives until manually closed.

resource "aws_db_instance" "preview" {
  instance_class    = "db.t3.medium"
  allocated_storage = 20
  # ~$50/month idle per environment
  # 100 open PRs = $5,000/month just RDS
}
Good - namespace only, shared DB, hard TTL
# k8s namespace per PR. Shared RDS + Redis.
# TTL controller auto-destroys at 48h idle.

- name: Create preview namespace
  run: |
    kubectl create namespace pr-${{ github.event.number }}
    kubectl create secret generic db-creds \
      --namespace pr-${{ github.event.number }} \
      --from-literal=url="$SHARED_PREVIEW_DB_URL"
    kubectl label namespace pr-${{ github.event.number }} \
      preview-ttl-hours="48"

# Per env: ~$0.40/day vs $16/day.
# 100 envs: $40/day vs $1,600/day.

What we changed for ephemeral environments

We introduced three changes mid-enrollment. First, a 48-hour TTL enforced by a namespace controller - any preview environment older than 48 hours with no traffic got automatically torn down. Second, all preview environments moved to shared RDS and Redis - isolated only at the application layer via namespaced table prefixes and Redis key prefixes. Third, preview environments were rate-limited at the ingress layer to 10 req/s per environment to prevent them from generating meaningful load on shared services.

Shared vs isolated: what stayed isolated

We kept the enrollment-events Kafka topic isolated per preview environment - a separate topic per PR - because preview environments were used to test enrollment flow end-to-end, and a shared topic would have created cross-environment event contamination. Everything else moved to shared infrastructure. The cost dropped from $41,000 to $9,000 in December with no loss of testing fidelity.

AWS patterns
  • EKS namespaces with namespace-scoped RBAC isolate ephemeral environments within a shared cluster cheaply.
  • Use AWS CodeBuild with environment-specific parameter overrides in SSM Parameter Store to inject config per environment.
  • Lambda + API Gateway can stand up a fully isolated API surface per PR with near-zero idle cost.
  • RDS snapshot restore or Aurora cloning gives you a database copy in under 5 minutes for environments that need real data.
GCP patterns
  • GCP Projects as the isolation boundary give the strongest resource separation but the most provisioning overhead. Good for security-sensitive workloads.
  • Cloud Run revisions per PR with traffic splitting let you test without a separate environment at all for simple service changes.
  • Cloud SQL cloning creates a full copy of a database in minutes, suitable for environments that need isolated data.
  • Firebase Hosting preview channels give you per-PR URLs for frontend assets without any infrastructure management.
Ephemeral environments under 15x thinking If you run 30 ephemeral environments simultaneously and each one generates meaningful load on shared services, you are effectively running a background load test at all times. Isolate shared dependencies with virtualservice routing or mock services - do not let ephemeral traffic contaminate your staging or production database replicas.

Platform decisions: what we used and what we wish we had used

This was an AWS engagement - the client had been on AWS for seven years and was not moving. But several of the failures we encountered during enrollment had cleaner solutions in the AWS service catalog that we were not yet using. Here is the breakdown of what we had, what we added, and where we had gaps.

Layer What we had What we changed to Why
Container orchestrationEKS + Cluster AutoscalerEKS + KarpenterCluster Autoscaler added nodes in 90s. Karpenter targets under 60s. Added post-Nov 15.
HPA signalCPU utilization onlyCPU + MSK queue depth metricCPU lagged actual load. Queue depth gave 90s earlier warning to scale.
ServerlessLambda (default concurrency 1,000)Lambda raised to 5,000 + SQS bufferHit default limit on tax credit Lambda. Raised + added SQS to absorb bursts.
Kafka partition keyMSK, plan_id keyMSK, member_id + event_type keyplan_id caused hot partition on PPO plan. Composite key eliminated it post-Nov 8.
DB connection managementDirect pod-to-Aurora, pool_size=5RDS Proxy, pool_size=2Hit 500-conn limit at 101 pods. RDS Proxy multiplexes 2,500 app to 20 DB conns.
Aurora read scaling1 read replica (manual)Aurora auto-scaling replicas (2-5)Replication lag hit 8s under write pressure. Auto-scaling replicas added before Dec 15.
External API cachingDirect calls to tax credit API, no cacheRedis cache, 60s TTL, in front of tax APIRate-limited at 500 req/s; we sent 3,000/s. Cache cut calls by 91%.
Circuit breakersNone on inter-service callsResilience4j, 1.5s timeout30s timeout caused cascade on Nov 15. 1.5s + circuit breaker stops bleed at 5 failures.
ObservabilityCloudWatch + DynatraceCloudWatch + Dynatrace + MSK lag alertsNo lag alerts before Nov 8. Added lag-based alerts as primary early warning after incident.
Ephemeral environmentsFull isolated stack per PRShared RDS + Redis, namespace isolation, 48h TTLPreview env cost: $41k Nov. After changes: $9k Dec. No regression in testing coverage.

What we changed: the post-enrollment remediation plan

This is the sequence of work to harden a system for 15x. Not every step applies to every architecture, but the order matters - fix the slow path first, then add capacity, then add resilience mechanisms.

Step 1: Establish baseline load profile

Before anything else, you need to know what 1x actually means for your system - requests per second, database connections used, Kafka consumer lag at baseline, cache hit rate, memory utilization per pod. Without this, you are guessing at headroom.

Step 2: Identify the weakest link

Run a load test at 3x and watch which layer degrades first. This is your constraint, and it is almost always a single bottleneck rather than a distributed failure. Address this before adding capacity everywhere.

Step 3: Fix connection management

RDS Proxy was the highest-priority infrastructure change. We provisioned it in front of the enrollment Aurora cluster and updated the enrollment API connection string. The proxy immediately reduced active DB connections from 450 (at peak) to 22. We also added Resilience4j circuit breakers to the enrollment API's calls to the eligibility service, with a 1.5-second timeout and a threshold of 5 consecutive failures.

Step 4: Tune autoscaling

Lower the HPA scale-up threshold. Set a meaningful minReplicas. Configure Karpenter or the Cluster Autoscaler with appropriate node templates for your workload. Verify that scale-up to 2x capacity completes in under 3 minutes.

Step 5: Add caching where the data allows it

Identify the top-10 most expensive database queries by frequency times latency. If any of those results are cacheable for even 5 seconds, caching them has a multiplicative effect on your database headroom.

Step 6: Implement rate limiting and circuit breakers

Add rate limiting at the API gateway level. Add circuit breakers on all calls to external or slower-tier dependencies. These are your last-resort protection mechanisms when traffic exceeds capacity - they make degradation controlled rather than chaotic.

Step 7: Validate with a fire drill

Two days before the December 15th final deadline, we ran a 4-hour load test at 12x baseline in staging. The circuit breakers opened correctly when we artificially slowed the eligibility service. The Kafka hot partition did not reappear. RDS Proxy held at 22 active connections across 120 simulated pods. On the evening of December 14th we manually pre-scaled all enrollment services to 3x. December 15th had zero P1 or P2 incidents. Peak load hit 14.3x baseline. Error rate stayed below 0.2%.

What enrollment taught us Every failure during open enrollment was predictable in retrospect. The hot Kafka partition, the connection wall, the cascade from the eligibility service, the external API rate limit - none of these were obscure edge cases. They were well-documented failure modes that we had not explicitly checked against our enrollment capacity assumptions. The platform was not badly built. It was built for a load profile that open enrollment does not match. The work for 2025 is to run this load test before November 1st.