SLOs vs SLIs vs SLAs: A Comprehensive Guide
Stop confusing these three things. This confusion is burning out your engineers and shipping unreliable software.
The problem
Most teams use SLI, SLO, and SLA interchangeably. They pick availability targets out of thin air, chase impossible reliability numbers, and then wonder why their engineers are burning out. Nothing actually gets better.
The confusion is understandable. All three terms use the word "service level." All three are about reliability. But they operate at completely different layers, they're owned by different people, and getting them wrong in different directions creates completely different problems.
Understanding the difference is the difference between a team that ships at speed and a team that is always on fire.
Let's use ride sharing as an example
You open the app. You request a ride. Three things matter to different people in that company:
The SLI
What you actually measure.
The SLI is "time from ride request to driver match." You pick the number. You track it. That's it. It's a raw measurement, not a target.
Good SLIs are specific, user-facing, and measurable in production. Bad SLIs are internal metrics that don't reflect what the user actually experiences.
The SLO
The internal target you set for that measurement.
"95% of ride requests get matched within 2 minutes." Internal. Engineering-owned. No legal weight. It drives your on-call thresholds, your error budgets, and your sprint priorities.
If you breach your SLO, you fix it. No lawyers involved.
The SLA
The promise to customers, with consequences attached.
"If <X% of rides are matched within 2 minutes, we give you a refund." Legal document. Financial penalties. Sales and legal own this, not engineering.
If you breach your SLA, you owe someone money.
An SLO is your internal target for that measurement.
An SLA is the external promise with money on the line if you miss it.
Another way to think about it: SLIs are inputs, SLOs are thresholds, SLAs are contracts. They stack. Your SLA should be set based on your SLO, and your SLO should be set based on your SLI data. If you set them independently, you will either over-promise to customers or over-burden your engineers.
Why the distinction actually matters
If you conflate SLO and SLA, you destroy your team
Say your SLA promises 99.95% availability. Your team sets the internal SLO to match: 99.95%. What happens?
There is no buffer. Every degradation, every planned maintenance window, every brief spike in errors immediately puts you in breach of your customer contract. Engineers work weekends. Releases get blocked. Every on-call shift is high-stakes. The team stops taking risks because risks have legal consequences.
The right relationship: your SLO should always be stricter than your SLA. If you promise customers 99.5%, set your internal SLO to 99.9%. The gap between them is your buffer - time to detect, respond, and fix before it becomes a contract violation.
If you set SLOs without SLI data, you're guessing
The most common mistake is setting a 99.9% SLO because it "sounds right" or because a competitor claims 99.9%. If your system has historically run at 99.6%, a 99.9% SLO means you'll be in breach constantly from day one. Your error budget will be gone in the first week of every month.
Start with your actual SLI data. If you've been hitting 99.7% for six months, set your initial SLO at 99.7%. Then improve the system and tighten the SLO over time. Aspirational SLOs set without data create noise and burn trust in the metric.
The cost of each extra nine
Moving from 99% to 99.9% is 10x harder than it sounds. Moving from 99.9% to 99.99% is another 10x. Each additional nine requires redundancy, faster detection, faster recovery, and significantly more operational investment.
99.9% gives you 8.7 hours of downtime per year. 99.99% gives you 52 minutes. The question is: does your business need 52 minutes instead of 8.7 hours? What does that difference actually cost your customers?
SLOs create permission, not just constraints
This is the part most teams miss. An SLO isn't just a ceiling on how bad things can get. It's also a floor on how much reliability work you have to do at any given time.
If your error budget is 0.1% per month and you've only used 0.02% so far, you have 0.08% of budget left. That's permission to take risks. Ship the refactor. Deploy the new architecture. Run that experiment you've been putting off. Your SLO tells you when to slow down and when you can move fast.
How SLIs, SLOs, and SLAs relate in practice
| Dimension | SLI | SLO | SLA |
|---|---|---|---|
| What it is | A measured signal from your system | A target ratio over time for that signal | A legal commitment to a customer |
| Who owns it | Engineering (observability team) | Engineering + product | Sales, legal, customer success |
| If you miss it | Nothing - it's a measurement, not a target | Internal response: fix the system | External consequence: credits, penalties |
| How strict | N/A - it's just data | Stricter than your SLA | Looser than your SLO |
| Example | API P99 latency measured at 180ms | 99% of requests under 200ms (internal) | 99% under 300ms or customer gets credit |
Choosing the right SLIs
The most important design decision is which signals to measure. The temptation is to measure everything. Don't. Pick one or two SLIs per service - the ones that, if degraded, would actually affect a user's experience.
The Google SRE book defines four golden signals: latency, traffic, errors, and saturation. Most services need SLIs from two of these four. More than that and you'll spend all your time managing dashboards instead of improving reliability.
Select your service type below to see what SLIs typically matter:
Request-driven services (APIs, web apps)
These services have a user on one end making a synchronous request. They care about two things: did they get a response, and was it fast enough.
-
Availability
"99.9% of requests return 2xx or 3xx status codes."
Measure success rate, not uptime. A server can be "up" and still returning 500s. Count the responses users actually receive. -
Latency
"95% of requests complete in under 200ms."
Always use percentiles, not averages. An average of 150ms can hide a P99 of 4 seconds. That P99 is a real user who waited 4 seconds and probably left. Track P95 and P99 separately.
Pipeline services (data ingestion, batch jobs, ETL)
These services don't have an interactive user. The "user" is a downstream system or a business analyst who needs the data. What they care about is whether the data arrived and whether it's correct.
-
Freshness
"95% of dashboard data is less than 5 minutes old."
Stale data causes bad decisions. A pipeline can be "running" while writing data that's 3 hours behind. Freshness catches this where availability metrics don't. -
Correctness
"99.99% of records match the source schema with no null violations."
One corrupted batch can invalidate weeks of downstream analysis. If you're running financial or operational pipelines, correctness is often more important than freshness. -
Completeness
"99.9% of expected records arrive within the processing window."
Distinct from correctness - completeness measures whether all records showed up, not whether they're right. Silent data loss is common in pipelines and hard to detect without an explicit completeness SLI.
Storage services (databases, object storage)
Storage is where people store their most important data. The failure modes are different from stateless services - durability failures are often permanent, and reads and writes can fail independently.
-
Durability
"99.999999999% of stored objects remain readable."
This is the one SLI where you actually want many nines. Data loss is usually permanent. Measure it by tracking expected vs. returned objects in regular read probes. -
Read availability vs. write availability
Track these separately. A database under write pressure will often start failing writes while reads stay healthy. A single availability SLI hides this. Separate read and write success rates give you much earlier signal. -
Read latency at percentile
"P99 read latency under 20ms."
For databases, latency degradation under load is usually the first warning sign before availability starts dropping. Use it as an early indicator.
Observability stack (metrics, logs, traces)
This one is often skipped - teams build SLOs for their product but not for the reliability infrastructure itself. If your observability stack is unreliable, you lose visibility exactly when you need it most: during incidents.
-
Metric ingestion completeness
"99.9% of emitted metrics are queryable within 60 seconds."
Dropped metrics mean blind spots. You won't know about them unless you're explicitly measuring ingestion success rate. -
Trace coverage
"95% of requests have a complete distributed trace."
Incomplete traces make RCA much harder. Measure coverage by sampling requests and verifying trace completeness end to end. -
Alert delivery latency
"Critical alerts delivered within 60 seconds of threshold breach."
An alert that fires 8 minutes after a threshold was breached is not a functioning alert. This SLI directly affects your MTTD.
How to set SLOs step by step
-
Pick 1-2 SLIs per service. Start with the signal that, if degraded, would be the first thing a user complains about. For most services that's availability and latency. Don't add more SLIs until you've operated the first two for at least a quarter.
-
Look at 90 days of historical SLI data. Find your actual baseline. If you've been hitting 99.7% availability for three months, your SLO should start at 99.7%. Setting it at 99.95% without system changes is just setting yourself up to be in breach constantly.
-
Set the SLO slightly inside your baseline. If your P95 latency has averaged 180ms over 90 days, set the SLO target at 200ms. This leaves room for variance without breaching, while still flagging genuine regressions.
-
Define your error budget from the SLO. An SLO of 99.9% over a 30-day window means 0.1% of requests can fail. That's your error budget. Burn it wisely - risky deploys, experiments, and migrations all draw from it.
-
Alert on error budget burn rate, not raw thresholds. Instead of alerting when latency exceeds 200ms, alert when you've burned 10% of your monthly error budget in the last hour. This gives you time to respond before the month is gone, not after.
-
Set your SLA looser than your SLO. If your SLO is 99.9%, your SLA should be 99.5%. The gap is your contractual buffer. Never let your SLA be tighter than your SLO - that's how contract violations happen during routine operations.
Error budget calculator
Put in your monthly request volume and your target SLO to see your error budget and how quickly you'd burn through it at different failure rates.
Error budget calculator
How many failed requests can you afford, and what does burning the budget fast look like?
Common mistakes and how they play out
- Setting aspirational SLOs with no historical basis You haven't measured your current P99 latency. You set an SLO of "P99 under 100ms" because it sounds good. Your actual P99 turns out to be 380ms. You're in breach on day one. Your on-call fires constantly. Engineers stop trusting the alerts because they're always firing. The SLO becomes noise.
- SLO equals SLA with no buffer You promise customers 99.95% in your contract, then set your internal SLO to 99.95%. During a routine deploy, you have a 4-minute partial outage. You've immediately breached your SLA. Customer success is fielding calls. Legal is involved. Engineering is scrambling to write an RCA for a 4-minute blip that would have been completely unremarkable if you'd had a buffer.
- Too many SLIs leading to alert fatigue You instrument 12 SLIs per service because you want full coverage. Now you have 84 active SLOs across 7 services. When something goes wrong, 20 alerts fire simultaneously. The on-call can't determine which ones are signal and which are symptoms of the same root cause. Alert fatigue sets in and engineers start silencing alerts without investigating. The monitoring system has become invisible.
- Alerting on raw thresholds instead of error budget burn rate You set an alert for "latency over 200ms." It fires 40 times a month during brief spikes, most of which resolve in under a minute. Engineers get paged at 3am for issues that self-healed before they even checked their phone. Meanwhile, a slow, sustained latency regression that eats 60% of your error budget over 3 days generates no alerts because it never crosses the threshold. You find out when your SLO report runs at the end of the month.
- Not using the error budget to make shipping decisions You have an SLO and you track it, but nobody looks at the error budget before planning releases. A team ships three risky deploys in the first week of the month, burns 80% of the error budget, and then everyone wonders why there's no room for the infrastructure migration planned for week three. Error budgets only work if they're an input to planning, not just a retrospective report.
- Measuring the wrong thing as your SLI You measure server uptime as your availability SLI. Your server is up 99.98% of the time. But your CDN is misconfigured and 8% of users in Europe are getting errors. Your SLI shows green while your users are having a bad time. SLIs need to measure what the user actually experiences, not what's convenient to instrument internally.
The takeaway
SLIs are what you measure. SLOs are what you target internally. SLAs are what you promise externally with consequences attached.
The relationship between them matters as much as the definitions. Your SLO should be tighter than your SLA. Your SLO should be grounded in your SLI history, not in what you wish were true. And your SLIs should measure what users actually experience, not what's convenient to track.
The teams that get this right use error budgets as a decision-making tool, not just a reporting metric. Error budget left? Move fast. Error budget gone? Stop shipping features and fix the system. That's not bureaucracy. That's a clear framework for when to take risk and when to stop.
Most teams only need to get two things right to start: pick two SLIs per service based on user impact, and set SLOs based on actual historical data with a buffer below your SLA. Everything else follows from there.