Writing SLOs Your Engineers Will Actually Respect
SLOs fail when they are vanity metrics handed down from above. Here is how to define ones that drive real decisions and end on-call burnout.
A wall of dashboards isn't observability, and "99.99% uptime" on a slide isn't an SLO. A good SLO changes what your team does on a Tuesday.
Measure what the user feels
Pick SLIs that track user experience, not server internals:
- Request success rate (the user got a valid response).
- Latency at p95 and p99 (it was fast enough).
- Freshness for data pipelines (the data wasn't stale).
CPU usage is a cause, not a symptom. Alert on symptoms.
Set the target with the business, not for them
An SLO is a negotiation, not a decree. Sit down with product and ask: "What level of failure is acceptable before we stop shipping features and fix reliability?" That number is your error budget.
Let the error budget drive decisions
This is where SLOs earn their keep:
- Budget healthy? Ship fast, take risks.
- Budget burning? Freeze risky changes, invest in reliability.
The error budget turns "are we reliable enough?" from an argument into a number.
Kill the noise
If an alert doesn't require a human to act now, it is a dashboard, not a page. Ruthlessly downgrade everything else. On-call sustainability is a feature.
Want this applied to your stack?
Book a 30-minute call and we’ll find the highest-impact next step together.
Book a consulting call →