SRE Practices

Site Reliability Engineering principles: SLIs, SLOs, and error budgets

SRE Practices

Site Reliability Engineering (SRE) is Google's approach to DevOps. It applies software engineering principles to infrastructure and operations problems.

SLIs, SLOs, and SLAs


  ┌──────────────────────────────────────────────────┐
  │                                                  │
  │  SLI (Service Level Indicator)                   │
  │  What you measure:                               │
  │  - Availability: successful requests / total     │
  │  - Latency: percentage of requests under 200ms   │
  │  - Throughput: requests per second                │
  │                                                  │
  │  SLO (Service Level Objective)                   │
  │  What you promise internally:                    │
  │  - "99.9% of requests succeed"                   │
  │  - "99% of requests under 200ms"                 │
  │                                                  │
  │  SLA (Service Level Agreement)                   │
  │  What you promise customers:                     │
  │  - Contract with financial penalties             │
  │  - Usually less strict than SLOs                 │
  │                                                  │
  └──────────────────────────────────────────────────┘

Error Budgets


  SLO: 99.9% availability = 0.1% error budget

  Month: 30 days = 43,200 minutes
  Error budget: 43,200 × 0.001 = 43.2 minutes of downtime

  ┌─────────────────────────────────────┐
  │ Error Budget Status: July           │
  │ ████████████████░░░░░░░░░░░░░ 65%   │
  │ Used: 28 min | Remaining: 15.2 min │
  └─────────────────────────────────────┘

  If budget is consumed:
  → Freeze deployments
  → Focus on reliability work
  → No new features until budget resets

Blameless Postmortems

When incidents occur, SRE culture focuses on systemic improvement:


  Postmortem Template:
  ┌─────────────────────────────────────────┐
  │  1. Summary — What happened?            │
  │  2. Impact — Users affected, duration   │
  │  3. Timeline — Minute-by-minute events  │
  │  4. Root Cause — What actually failed   │
  │  5. Action Items — Prevention & fixes   │
  │     - [ ] Add monitoring for X          │
  │     - [ ] Fix the bug in Y              │
  │     - [ ] Improve runbook for Z         │
  └─────────────────────────────────────────┘

🧪 Quick Quiz

What is an SLO (Service Level Objective)?

← Previous Alerting Strategies

Next → Terraform