Discovery Questions
- •Do all critical services have defined SLIs and SLOs?
- •Who owns defining and reviewing them—product, engineering, or operations?
- •How are SLOs measured, tracked, and reported?
- •Are SLOs visible to teams in real time?
- •How often are SLOs revisited or recalibrated?
- •Are SLO violations linked to error budgets that inform roadmap decisions?
- •How are trade-offs between velocity and reliability made?
Evidence to Collect
- •SLO dashboards and reports
- •SLI query definitions
- •Reliability review notes
Implementation Patterns
SLI/SLO Framework
Design SLIs around user journeys and automate SLO compliance reporting.
PrometheusGrafanaSlothOpenSLO
Steps
- Instrument availability and latency SLIs with Prometheus recording rules.
- Use Sloth or Pyrra to codify SLOs and generate alerting burn-rate policies.
- Publish shared dashboards showing real-time error budget status.
- Automate compliance reports for stakeholders and product teams.
Error Budget Policy
Align release velocity with error budget consumption through explicit policy gates.
Steps
- Define budget states (healthy, watch, exhausted) with clear actions.
- Freeze feature work and trigger a reliability swarm when budgets exhaust.
- Integrate budget checks into deployment pipelines and change approvals.
