SRE Practices
Site Reliability Engineering (SRE) is Google's approach to DevOps. It applies software engineering principles to infrastructure and operations problems.
SLIs, SLOs, and SLAs
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ SLI (Service Level Indicator) โ
โ What you measure: โ
โ - Availability: successful requests / total โ
โ - Latency: percentage of requests under 200ms โ
โ - Throughput: requests per second โ
โ โ
โ SLO (Service Level Objective) โ
โ What you promise internally: โ
โ - "99.9% of requests succeed" โ
โ - "99% of requests under 200ms" โ
โ โ
โ SLA (Service Level Agreement) โ
โ What you promise customers: โ
โ - Contract with financial penalties โ
โ - Usually less strict than SLOs โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Error Budgets
SLO: 99.9% availability = 0.1% error budget
Month: 30 days = 43,200 minutes
Error budget: 43,200 ร 0.001 = 43.2 minutes of downtime
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Error Budget Status: July โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 65% โ
โ Used: 28 min | Remaining: 15.2 min โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
If budget is consumed:
โ Freeze deployments
โ Focus on reliability work
โ No new features until budget resets
Blameless Postmortems
When incidents occur, SRE culture focuses on systemic improvement:
Postmortem Template:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Summary โ What happened? โ
โ 2. Impact โ Users affected, duration โ
โ 3. Timeline โ Minute-by-minute events โ
โ 4. Root Cause โ What actually failed โ
โ 5. Action Items โ Prevention & fixes โ
โ - [ ] Add monitoring for X โ
โ - [ ] Fix the bug in Y โ
โ - [ ] Improve runbook for Z โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ