SREPrometheusGrafanaOpenTelemetryPagerDuty
The problem
On-call was drowning in low-signal alerts and incidents took hours to diagnose. There was no shared definition of healthy, so reliability work stayed reactive.
What I built
- 1Instrumented services with OpenTelemetry for traces, metrics and logs.
- 2Defined SLIs, SLOs and error budgets with product and engineering together.
- 3Rebuilt dashboards and alerting around symptoms rather than causes, which killed the noise.
- 4Wrote runbooks and ran incident game-days to build muscle memory.
The outcome
MTTR dropped 58% and alert noise fell 70%. On-call became sustainable, and reliability decisions are now driven by error budgets instead of gut feel.
Want an outcome like this?
Book a call and let’s scope what it would take for your stack.
Book a consulting call →