All Capabilities

⚙️ Reliability Engineering

Balance reliability with velocity through measurable guardrails.

Evaluate SLO management, observability, incident response, and resilience practices. Each section maps discovery questions to implementation patterns.

Discovery Questions

•Do all critical services have defined SLIs and SLOs?
•Who owns defining and reviewing them (product, engineering, or operations)?
•How are SLOs measured, tracked, and reported?
•Are SLOs visible to teams in real time?
•How often are SLOs revisited or recalibrated?
•Are SLO violations linked to error budgets that inform roadmap decisions?
•How are trade-offs between velocity and reliability made?

Evidence to Collect

•SLO dashboards and reports.
•SLI query definitions.
•Reliability review notes.

Implementation Patterns

SLI/SLO Framework

Design SLIs around user journeys and automate SLO compliance reporting.

PrometheusGrafanaSlothOpenSLO

Steps

Instrument availability and latency SLIs with Prometheus recording rules.
Use Sloth or Pyrra to codify SLOs and generate alerting burn-rate policies.
Publish shared dashboards showing real-time error budget status.
Automate compliance reports for stakeholders and product teams.

Error Budget Policy

Align release velocity with error budget consumption through explicit policy gates.

Steps

Define budget states (healthy, watch, exhausted) with clear actions.
Freeze feature work and trigger a reliability swarm when budgets exhaust.
Integrate budget checks into deployment pipelines and change approvals.

Tips & Tricks

Learn from Ani's sleepless nights

Browse the full playbook

Battle-tested defaults from the platform playbook. Filter by layer, search, and steal the snippet over at the full playbook.

Accessibility

Font Size: 100%

Contrast

Letter Spacing: 0px

Line Height: 1.5

Reduce Motion