Case study
Managed care platform scaling during open enrollment
A large managed care organization supporting millions of members across Medicaid and marketplace plans. One annual open enrollment window. Six weeks of traffic the platform was never sized for. This case study documents the failure patterns we encountered, the decisions we made, and what the infrastructure looked like on the other side.
Published Mar 16, 2026
The situation
Most teams build for 2x. The architecture works, the database holds, the pods come up. Then a product launch, a seasonal spike, or a viral event hits - and suddenly you are staring at 8x, 12x, 15x baseline. That is when the design assumptions that nobody wrote down come back to haunt you.
The challenge is not the 15x itself. A well-designed system should handle it. The challenge is that every component in your stack - compute, networking, data, API, cache, messaging - has a different capacity cliff. They do not all fall over at the same load. They fall over in sequence, which turns a predictable scaling problem into a rolling incident chain.
This article breaks down each layer - what breaks first, why, and what the remediation looks like in practice. The framing is AWS and GCP because those are where most engineering teams actually operate, and the service primitives are different enough to matter.
How load scaled during enrollment
Open enrollment does not spike all at once. It ramps. The first week is manageable - returning members, early filers. By week two the employer group deadlines hit. By week three the ACA deadline anxiety sets in and load becomes genuinely unpredictable. Here is how it actually played out in 2024.
| Enrollment period | Load vs baseline | Risk profile | What we saw |
|---|---|---|---|
| Baseline | 1x | Stable | Individual component bugs, not capacity |
| Growth | 2x - 4x | Manageable | Cache misses, connection pool exhaustion, under-provisioned DB |
| Strain | 5x - 8x | Elevated | Queue buildup, hot partitions, API throttling, disk I/O on DBs |
| Stress | 9x - 12x | Critical | Cascading failures across services, cold start storms, memory pressure |
| Survival | 13x - 15x | Incident | External dependency failures, DNS degradation, network saturation |
The ACA deadline week was the critical window. We knew it was coming. We had two weeks of enrollment data to model what was approaching. We still had a P1 incident. The issue was not that we did not scale - we did. The issue was the sequence of failures that emerged as each layer hit its limit one after another, faster than we could respond reactively.
What broke and why: failure patterns across the stack
Six distinct failure categories surfaced during enrollment. Some we caught in pre-enrollment load testing. Others we found live. Here is how they mapped to what we actually experienced.
Compute failures: the cold start storm on Nov 15
The eligibility service is a Spring Boot application. It loads a member eligibility cache from DynamoDB on startup - around 12 seconds of initialization before it can serve traffic. Under normal conditions this is invisible. During the November 15th traffic surge, HPA fired and spun up 40 new eligibility pods. All 40 were in the 12-second boot window simultaneously while the existing 18 pods were fully saturated. The result was 45 seconds of degraded eligibility checks across the member portal.
Container-based workloads (Kubernetes)
The Kubernetes scaling story has two components that need to work together: the Horizontal Pod Autoscaler (HPA) for pod count and the Cluster Autoscaler (or Karpenter on AWS) for node capacity. Most teams configure HPA correctly and then discover that their cluster autoscaler is too slow when the nodes are not there.
- Set HPA targets based on a combination of CPU and custom application metrics - pure CPU scaling lags too much under bursty traffic.
- Set
minReplicashigh enough to absorb the first 2x of traffic without triggering a scale event. Cold starts during the opening wave are the worst time to be provisioning capacity. - Use pod disruption budgets (PDBs) to prevent autoscaling from evicting too many replicas during a scale-down after the spike.
- Set
stabilizationWindowSecondson scale-down aggressively. Scale up fast, scale down slow. Thrashing after a spike is its own incident. - On AWS, Karpenter with spot instance consolidation handles node provisioning faster than the legacy Cluster Autoscaler. On GCP, GKE Autopilot handles this natively.
The cold start storm
A cold start storm is what happens when your autoscaler fires correctly but your new pods are useless for the first 30-90 seconds of their life - and those 30-90 seconds are exactly when you need them. It is common in JVM-based services (Spring Boot, Scala) and anything that pre-warms a local cache on startup.
The failure shape: traffic spikes, HPA fires, 40 new pods begin starting, the existing 20 pods are saturated, latency climbs, error rate climbs. Forty-five seconds later the new pods come online. The autoscaler worked perfectly. The system still had an incident.
# One probe doing everything wrong.
# No startup probe - traffic routes in at 5s.
# No liveness probe - stuck pods never restart.
# /health checks a dependency, not app state.
readinessProbe:
httpGet:
path: /health # checks DB conn - too broad
port: 8080
initialDelaySeconds: 5 # Spring still loading
periodSeconds: 5
# At 5s the app is mid-boot. It accepts the
# request, tries to hit a bean that isn't
# initialized yet, and throws 500s.
# No liveness probe means a deadlocked pod
# stays in rotation indefinitely.
# Three probes. Each does a different job.
# Startup: gate traffic until warmup is done.
# Readiness: remove from LB if temporarily unhealthy.
# Liveness: restart if stuck / deadlocked.
startupProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
failureThreshold: 30
periodSeconds: 3 # up to 90s for warmup
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3 # restart after 30s unresponsive
# Liveness and readiness are separate endpoints in Spring.
# Never point liveness at a dependency check -
# one slow DB response should not restart your pod.
minReplicas to 2x your baseline pod count. The idle cost of carrying extra warm pods is almost always less than the cost of one cold-start-driven incident per quarter.
Serverless and function-based compute
Two lower-tier services - tax credit calculation and address validation - were running as Lambda functions behind API Gateway. These hit the default regional concurrency limit (1,000) on November 15th within 6 minutes of the peak onset. Tax credit calculation started returning 429s to the enrollment API, which propagated into failed plan submissions.
- Default regional concurrency limit is 1,000. Raise it proactively. At 15x you will hit it.
- Use provisioned concurrency for latency-sensitive paths. Eliminates cold starts at the cost of always-on spend.
- Lambda SnapStart (Java) eliminates JVM init cold start - matters if your functions are JVM-based.
- SQS + Lambda with reserved concurrency lets you absorb queue depth without hammering downstream dependencies.
- Set a minimum instance count to keep warm instances alive at baseline. Going from zero to one under load adds 2-4 seconds of cold start latency at exactly the worst moment.
- Cloud Run's CPU is only allocated during request handling by default. Enable CPU always-on for background tasks that need it.
- Concurrency per instance defaults to 80. Tune this based on your workload's memory footprint vs throughput.
- Cloud Run paired with Pub/Sub push subscriptions handles async event-driven scale cleanly.
Streaming failures: the enrollment event queue
Enrollment events - plan selections, dependent additions, coverage confirmations - flow through a Kafka topic called enrollment-events. Every enrollment action a member takes produces one or more events. Downstream consumers handle eligibility verification, document generation, and CRM sync. During peak enrollment, the event volume is the first thing that backs up.
Kafka
At 1x, Kafka partition count is a footnote. At 15x, it determines your maximum consumer parallelism. Partition count is fixed at topic creation - you can add partitions but the rebalancing cost and key-ordering implications are significant. Size it generously from the start.
The most common failure mode is a consumer lag storm. Traffic spikes, producers publish faster than consumers can process, lag accumulates, and consumers fall behind by hours. If you have alerts on lag, this surfaces quickly. If you do not, you find out when downstream systems start serving stale data or timing out.
- Partition count should be at least 2x your expected peak consumer parallelism. This gives you room to scale consumer group size without repartitioning.
- Watch for hot partitions when your partition key has low cardinality - all messages for a given user or tenant routing to the same partition is a classic trap.
- Tune
max.poll.recordsandsession.timeout.mstogether. Under high load, a consumer that processes too slowly will hit its session timeout, trigger a full group rebalance, and make the lag problem significantly worse before it gets better. - Use a dead-letter topic for poison messages. A single malformed message in a Kafka partition will stall that partition's consumers indefinitely if you do not handle it.
Edge case: the hot partition trap
Our enrollment-events topic had 12 partitions, keyed on plan_id. The assumption was that enrollment events would distribute roughly evenly across plans. They do not. During the employer group deadline week, three large employer groups - all enrolled in the same PPO plan - submitted simultaneously. Every event from those groups routed to a single partition. That partition's consumer hit 100% CPU. Lag climbed to 390,000 messages. Downstream eligibility verification stalled. Members who completed enrollment did not receive their confirmation documents for over 90 minutes.
The fix was a composite partition key: member_id + event_type. Applied after the November 8th incident. No hot partitions during the November 15th peak despite higher total volume.
# All events for one tenant → one partition
def get_partition_key(event):
return event["tenant_id"]
# acme-corp hits 800 events/sec at month-end.
# All route to P-7. 11 consumers idle.
# Session timeout → rebalance → 3 min outage.
import hashlib
def get_partition_key(event, num_partitions=12):
raw = f"{event['tenant_id']}:{event['type']}"
h = int(hashlib.md5(raw.encode()).hexdigest(), 16)
return h % num_partitions
# acme-care:PLAN_SELECTION, acme-care:DEPENDENT_ADD,
# acme-care:DOC_UPLOAD → different partitions.
# 800 enrollment events/sec spread across 4+ partitions.
AWS Kinesis
Kinesis scales through shards, and each shard gives you 1 MB/s write and 2 MB/s read. At 15x, you need to have pre-calculated how many shards that translates to for your payload sizes, and you need to verify your application does not have write-hotspotting on a subset of shards.
Kinesis enhanced fan-out eliminates the 2 MB/s read limitation for critical consumers by giving each consumer their own dedicated throughput. Use it for latency-sensitive downstream consumers.
GCP Pub/Sub
Pub/Sub handles scale more transparently than Kafka or Kinesis - Google manages the partition layer. The scaling concern shifts to your subscription configuration and consumer fleet. Undelivered message backlog caps at 10GB per subscription by default. Under extreme spike conditions, messages start being dropped if backlog grows beyond retention settings.
Database failures: the connection wall and the eligibility bottleneck
The database is almost always the hardest layer to scale horizontally. Compute is stateless and elastic. Databases are stateful, and statefulness fights elasticity at every step. At 15x, your database scaling strategy either works or it does not, and there is very little you can do reactively once load is on the system.
Relational databases
Read replicas are the first line of defense. Route read traffic to replicas, protect the primary for writes. The failure mode here is replication lag - under write-heavy load, replicas fall behind the primary, and reads from the replica serve stale data. Monitor replica lag as a primary SLO component, not an afterthought.
Connection pooling is essential at scale. At 15x traffic with a naive connection-per-request pattern, you saturate the database's max connection limit before CPU becomes a concern. PgBouncer for Postgres, ProxySQL for MySQL, and RDS Proxy on AWS (Cloud SQL Auth Proxy on GCP) sit in front of your database and multiplex application connections onto a smaller pool of database connections.
- Aurora auto-scaling read replicas can add capacity in minutes. Wire this up before a spike event, not during.
- Aurora Serverless v2 scales Aurora capacity units (ACUs) up and down in fine-grained increments. Useful for unpredictable workloads.
- RDS Proxy pools connections and reduces the failover impact during multi-AZ events. At high scale, its connection brokering is worth the overhead.
- Watch for Performance Insights when diagnosing DB slowness. Wait events tell you more than CPU or IOPS in isolation.
- Cloud SQL read replicas add horizontal read capacity. Cross-region replicas give you disaster recovery plus geographic read distribution.
- AlloyDB for Postgres is built for analytical and hybrid workloads at scale. Its columnar engine handles mixed OLTP/OLAP patterns that Cloud SQL struggles with.
- AlloyDB scales to 128 vCPUs and 864 GB RAM. For write-heavy OLTP at extreme scale, this is where Cloud SQL runs out of runway.
- Cloud SQL Auth Proxy handles IAM-based authentication and connection management without managing connection pool credentials separately.
NoSQL and wide-column stores
DynamoDB (AWS) and Bigtable (GCP) handle horizontal scale differently from relational systems, but they introduce their own failure modes at 15x. DynamoDB's most common scaling problem is provisioned capacity mode under a spike - on-demand mode avoids this but costs significantly more at sustained high load. The trade-off is worth modeling before a traffic event.
Bigtable's performance is sensitive to row key design. A monotonically increasing row key - like a timestamp prefix - creates a write hotspot on the most recent tablet. Distribute writes across the key space with a hash prefix or a reversed timestamp.
Caching
At 15x, your cache hit rate determines whether your database survives. A cache hit rate drop from 95% to 85% under spike conditions translates to 3x the database reads for the same traffic increase. Cache stampede - where many requests simultaneously miss on the same key - is a known failure mode during traffic spikes. Use probabilistic early expiry or locking patterns to prevent simultaneous cache rebuilds.
Edge case: the connection wall
The enrollment API service was configured with a SQLAlchemy connection pool of 5 per pod, max overflow 10. At baseline with 20 pods this consumed 100-300 connections against a db.r6g.2xlarge instance with a max_connections of 500. When enrollment traffic drove HPA to scale the enrollment API to 95 pods, we had 950 potential connections attempting to reach a 500-connection limit. The first FATAL errors appeared at pod 51.
The failure mode was a cliff, not a slope. Connection #501 failed with FATAL. The enrollment API logs filled with remaining connection slots are reserved within 90 seconds. The ALB health checks still passed because the pods themselves were running - just unable to reach the database. Our error rate alert was set at 5% threshold; by the time it fired we were at 34% 5xx on enrollment submissions.
# Each pod opens pool directly to DB.
# 100 pods x 5 = 500 = hard limit.
engine = create_engine(
"postgresql://db:5432/app",
pool_size=5,
max_overflow=10, # 15 per pod at burst!
)
# 100 pods x 15 = 1,500 attempted.
# DB rejects at 500. FATAL at scale.
# App connects to proxy, not DB.
# Proxy multiplexes into small real pool.
engine = create_engine(
"postgresql://rds-proxy:5432/app",
pool_size=2,
max_overflow=3,
pool_pre_ping=True,
)
# 500 pods x 5 = 2,500 app conns
# Proxy holds 20 real DB connections.
# DB sees 20 conns regardless of pod count.
API failures: the eligibility service cascade and the tax credit 429 flood
The enrollment API had two significant failure modes at the API layer. The first was a fan-out ratio problem on the plan detail endpoint - each request triggered 8 downstream service calls, one of which hit an external tax credit calculation API with a 500 req/s rate limit. The second was a cascading failure triggered by eligibility service slowdown that propagated all the way to the member-facing portal within 4 minutes.
Rate limiting and throttling
Rate limiting at the API layer protects your backend, and it also helps your callers. A 429 with a proper Retry-After header is far better than a 503 or a timeout. Clients that can handle 429 will back off and retry. Clients that get timeouts often do not, which generates more load at the worst moment.
Implement rate limiting at the gateway layer (API Gateway on AWS, Cloud Endpoints or Apigee on GCP) rather than inside your application code. Gateway-level rate limiting happens before your compute receives the request, which means it actually protects your backend instead of letting the traffic in first.
Circuit breakers
A circuit breaker sits between your API and its dependencies. When a downstream service starts failing or slowing, the circuit opens and requests fail fast without waiting for the timeout. This prevents thread exhaustion and connection pool depletion from slow dependencies.
The half-open state is where most implementations get tricky. After the circuit opens, you need to periodically probe whether the downstream service has recovered. Sending a single probe request and reopening on success is common. A better pattern is to slowly ramp the probe rate rather than flipping from zero to full traffic immediately.
Load shedding
Load shedding is the intentional rejection of lower-priority requests when the system is at capacity. This is different from rate limiting - rate limiting is per-client, load shedding is system-wide. The implementation requires you to have a priority model for your traffic. Health checks and critical paths should never be shed. Background jobs and low-priority reads are the first to go.
- Use API Gateway's usage plans on AWS or Apigee rate limit policies on GCP for per-client throttling.
- Implement circuit breakers via a sidecar (Envoy, Linkerd) or a library (Resilience4j, Polly) rather than hand-rolling them.
- Set aggressive timeouts on all downstream calls. A 30-second timeout does not protect you - it just delays the failure cascade. Most intra-service calls should time out in under 2 seconds.
- Instrument your circuit breaker state as a metric. Knowing when circuits are open and how long they stay open is essential observability data during an incident.
- Test your load shedding logic before you need it. Fire drill a traffic spike against a staging environment and verify that priority traffic is preserved.
Edge case: fan-out amplification at 15x
The GET /member/plan-options endpoint - the core plan comparison screen - called eight downstream services in sequence: member profile, plan inventory, network directory, formulary, cost estimator, subsidy calculator, plan rules, and the external tax credit API (two calls). At 1x this generated around 200 tax credit API calls per second. At 15x it generated 3,000. The external API's rate limit was 500 req/s. We hit it 4 minutes into the December 15th final deadline peak.
async def get_plan_options(member_id):
# Sequential. At 15x: tax_api gets 3,000/s.
profile = await member_profile.get(member_id)
inventory = await plan_inventory.list(member_id)
directory = await network_directory.lookup(member_id)
formulary = await formulary_svc.get(member_id)
cost = await cost_estimator.calculate(member_id)
subsidy = await subsidy_calc.run(member_id)
rules = await plan_rules.evaluate(member_id)
tax = await tax_api.calculate(member_id)
return assemble_plan_options(profile, inventory, directory, formulary, cost, subsidy, rules, tax)
async def get_plan_options(member_id):
# Parallel for independent calls.
# Cache external API - 60s TTL cuts calls 90%+
profile, inventory, directory, formulary, cost, subsidy, rules = \
await asyncio.gather(
member_profile.get(member_id),
plan_inventory.list(member_id),
network_directory.lookup(member_id),
formulary_svc.get(member_id),
cost_estimator.calculate(member_id),
subsidy_calc.run(member_id),
plan_rules.evaluate(member_id),
)
tax = await cache.get_or_fetch(
key=f"tax:{profile.region}",
fetch=lambda: tax_api.calculate(member_id),
ttl=60
)
return assemble_plan_options(profile, inventory, directory, formulary, cost, subsidy, rules, tax)
Edge case: cascade anatomy
On November 15th at 2:14 PM EST, the eligibility service's Aurora read replica started returning queries at 800ms instead of the normal 20ms. The enrollment API had a 30-second timeout on eligibility calls. It did not fail fast. It queued threads. Thread pool filled. Within 4 minutes the cascade had propagated through the API gateway to the member portal. Members saw blank plan comparison screens and timeout errors on plan submission.
The eligibility service had no circuit breaker. The enrollment API had no circuit breaker protecting calls to eligibility. The 30-second default timeout was the only protection - and it was not protection at all. It was just a delayed failure. The fix post-incident: 1.5-second timeout, circuit breaker at 5 consecutive failures, graceful degradation to cached eligibility data when the circuit is open.
# 30s timeout. Threads fill up.
# 200 threads x 30s = pool gone in 40s.
async with httpx.AsyncClient(
timeout=30.0 # too long - doesn't protect you
) as client:
resp = await client.get(
"http://eligibility-svc/check", json=member
)
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=30)
async def check_eligibility(member):
async with httpx.AsyncClient(
timeout=1.5 # tight. fail fast.
) as client:
return await client.get(
"http://eligibility-svc/check", json=member
)
# After 5 timeouts: circuit opens.
# Requests fail immediately. No thread hold.
# Downstream gets breathing room.
Ephemeral environments during enrollment
The development team was actively shipping enrollment bug fixes throughout the six-week period - sometimes multiple deploys per day. They had 12-15 open PRs at any given time during the first two weeks of enrollment, each with its own preview environment. This created a background infrastructure load that had not been accounted for in the enrollment capacity plan.
The core tension is that ephemeral environments need to be cheap enough to run at volume but realistic enough to be useful for testing. Getting this balance wrong in either direction causes problems: too cheap and they mask real scaling issues, too expensive and teams stop using them.
Edge case: cost explosion at team scale
The team had been running isolated preview environments for months - each with its own RDS instance and Redis cluster. At 15 concurrent PRs during peak enrollment, those environments were consuming capacity on the same AWS account. The enrollment period AWS bill for preview environments alone was $41,000 in November against a typical $8,000/month. Nobody had connected the bill increase to the enrollment timeline until week three. Worse, three of the preview environments had idle RDS instances that were sharing the same VPC subnet ranges as the production Aurora cluster - creating unnecessary network congestion during peak.
# terraform/preview/main.tf
# Each PR: 1 RDS + 1 Redis + 8 pods.
# No TTL. Lives until manually closed.
resource "aws_db_instance" "preview" {
instance_class = "db.t3.medium"
allocated_storage = 20
# ~$50/month idle per environment
# 100 open PRs = $5,000/month just RDS
}
# k8s namespace per PR. Shared RDS + Redis.
# TTL controller auto-destroys at 48h idle.
- name: Create preview namespace
run: |
kubectl create namespace pr-${{ github.event.number }}
kubectl create secret generic db-creds \
--namespace pr-${{ github.event.number }} \
--from-literal=url="$SHARED_PREVIEW_DB_URL"
kubectl label namespace pr-${{ github.event.number }} \
preview-ttl-hours="48"
# Per env: ~$0.40/day vs $16/day.
# 100 envs: $40/day vs $1,600/day.
What we changed for ephemeral environments
We introduced three changes mid-enrollment. First, a 48-hour TTL enforced by a namespace controller - any preview environment older than 48 hours with no traffic got automatically torn down. Second, all preview environments moved to shared RDS and Redis - isolated only at the application layer via namespaced table prefixes and Redis key prefixes. Third, preview environments were rate-limited at the ingress layer to 10 req/s per environment to prevent them from generating meaningful load on shared services.
Shared vs isolated: what stayed isolated
We kept the enrollment-events Kafka topic isolated per preview environment - a separate topic per PR - because preview environments were used to test enrollment flow end-to-end, and a shared topic would have created cross-environment event contamination. Everything else moved to shared infrastructure. The cost dropped from $41,000 to $9,000 in December with no loss of testing fidelity.
- EKS namespaces with namespace-scoped RBAC isolate ephemeral environments within a shared cluster cheaply.
- Use AWS CodeBuild with environment-specific parameter overrides in SSM Parameter Store to inject config per environment.
- Lambda + API Gateway can stand up a fully isolated API surface per PR with near-zero idle cost.
- RDS snapshot restore or Aurora cloning gives you a database copy in under 5 minutes for environments that need real data.
- GCP Projects as the isolation boundary give the strongest resource separation but the most provisioning overhead. Good for security-sensitive workloads.
- Cloud Run revisions per PR with traffic splitting let you test without a separate environment at all for simple service changes.
- Cloud SQL cloning creates a full copy of a database in minutes, suitable for environments that need isolated data.
- Firebase Hosting preview channels give you per-PR URLs for frontend assets without any infrastructure management.
Platform decisions: what we used and what we wish we had used
This was an AWS engagement - the client had been on AWS for seven years and was not moving. But several of the failures we encountered during enrollment had cleaner solutions in the AWS service catalog that we were not yet using. Here is the breakdown of what we had, what we added, and where we had gaps.
| Layer | What we had | What we changed to | Why |
|---|---|---|---|
| Container orchestration | EKS + Cluster Autoscaler | EKS + Karpenter | Cluster Autoscaler added nodes in 90s. Karpenter targets under 60s. Added post-Nov 15. |
| HPA signal | CPU utilization only | CPU + MSK queue depth metric | CPU lagged actual load. Queue depth gave 90s earlier warning to scale. |
| Serverless | Lambda (default concurrency 1,000) | Lambda raised to 5,000 + SQS buffer | Hit default limit on tax credit Lambda. Raised + added SQS to absorb bursts. |
| Kafka partition key | MSK, plan_id key | MSK, member_id + event_type key | plan_id caused hot partition on PPO plan. Composite key eliminated it post-Nov 8. |
| DB connection management | Direct pod-to-Aurora, pool_size=5 | RDS Proxy, pool_size=2 | Hit 500-conn limit at 101 pods. RDS Proxy multiplexes 2,500 app to 20 DB conns. |
| Aurora read scaling | 1 read replica (manual) | Aurora auto-scaling replicas (2-5) | Replication lag hit 8s under write pressure. Auto-scaling replicas added before Dec 15. |
| External API caching | Direct calls to tax credit API, no cache | Redis cache, 60s TTL, in front of tax API | Rate-limited at 500 req/s; we sent 3,000/s. Cache cut calls by 91%. |
| Circuit breakers | None on inter-service calls | Resilience4j, 1.5s timeout | 30s timeout caused cascade on Nov 15. 1.5s + circuit breaker stops bleed at 5 failures. |
| Observability | CloudWatch + Dynatrace | CloudWatch + Dynatrace + MSK lag alerts | No lag alerts before Nov 8. Added lag-based alerts as primary early warning after incident. |
| Ephemeral environments | Full isolated stack per PR | Shared RDS + Redis, namespace isolation, 48h TTL | Preview env cost: $41k Nov. After changes: $9k Dec. No regression in testing coverage. |
What we changed: the post-enrollment remediation plan
This is the sequence of work to harden a system for 15x. Not every step applies to every architecture, but the order matters - fix the slow path first, then add capacity, then add resilience mechanisms.
Step 1: Establish baseline load profile
Before anything else, you need to know what 1x actually means for your system - requests per second, database connections used, Kafka consumer lag at baseline, cache hit rate, memory utilization per pod. Without this, you are guessing at headroom.
Step 2: Identify the weakest link
Run a load test at 3x and watch which layer degrades first. This is your constraint, and it is almost always a single bottleneck rather than a distributed failure. Address this before adding capacity everywhere.
Step 3: Fix connection management
RDS Proxy was the highest-priority infrastructure change. We provisioned it in front of the enrollment Aurora cluster and updated the enrollment API connection string. The proxy immediately reduced active DB connections from 450 (at peak) to 22. We also added Resilience4j circuit breakers to the enrollment API's calls to the eligibility service, with a 1.5-second timeout and a threshold of 5 consecutive failures.
Step 4: Tune autoscaling
Lower the HPA scale-up threshold. Set a meaningful minReplicas. Configure Karpenter or the Cluster Autoscaler with appropriate node templates for your workload. Verify that scale-up to 2x capacity completes in under 3 minutes.
Step 5: Add caching where the data allows it
Identify the top-10 most expensive database queries by frequency times latency. If any of those results are cacheable for even 5 seconds, caching them has a multiplicative effect on your database headroom.
Step 6: Implement rate limiting and circuit breakers
Add rate limiting at the API gateway level. Add circuit breakers on all calls to external or slower-tier dependencies. These are your last-resort protection mechanisms when traffic exceeds capacity - they make degradation controlled rather than chaotic.
Step 7: Validate with a fire drill
Two days before the December 15th final deadline, we ran a 4-hour load test at 12x baseline in staging. The circuit breakers opened correctly when we artificially slowed the eligibility service. The Kafka hot partition did not reappear. RDS Proxy held at 22 active connections across 120 simulated pods. On the evening of December 14th we manually pre-scaled all enrollment services to 3x. December 15th had zero P1 or P2 incidents. Peak load hit 14.3x baseline. Error rate stayed below 0.2%.