Labs ICT
โญ Pro Login

SRE Practices

Site Reliability Engineering principles: SLIs, SLOs, and error budgets

SRE Practices

Site Reliability Engineering (SRE) is Google's approach to DevOps. It applies software engineering principles to infrastructure and operations problems.

SLIs, SLOs, and SLAs


  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚                                                  โ”‚
  โ”‚  SLI (Service Level Indicator)                   โ”‚
  โ”‚  What you measure:                               โ”‚
  โ”‚  - Availability: successful requests / total     โ”‚
  โ”‚  - Latency: percentage of requests under 200ms   โ”‚
  โ”‚  - Throughput: requests per second                โ”‚
  โ”‚                                                  โ”‚
  โ”‚  SLO (Service Level Objective)                   โ”‚
  โ”‚  What you promise internally:                    โ”‚
  โ”‚  - "99.9% of requests succeed"                   โ”‚
  โ”‚  - "99% of requests under 200ms"                 โ”‚
  โ”‚                                                  โ”‚
  โ”‚  SLA (Service Level Agreement)                   โ”‚
  โ”‚  What you promise customers:                     โ”‚
  โ”‚  - Contract with financial penalties             โ”‚
  โ”‚  - Usually less strict than SLOs                 โ”‚
  โ”‚                                                  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Error Budgets


  SLO: 99.9% availability = 0.1% error budget

  Month: 30 days = 43,200 minutes
  Error budget: 43,200 ร— 0.001 = 43.2 minutes of downtime

  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ Error Budget Status: July           โ”‚
  โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 65%   โ”‚
  โ”‚ Used: 28 min | Remaining: 15.2 min โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  If budget is consumed:
  โ†’ Freeze deployments
  โ†’ Focus on reliability work
  โ†’ No new features until budget resets

Blameless Postmortems

When incidents occur, SRE culture focuses on systemic improvement:


  Postmortem Template:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  1. Summary โ€” What happened?            โ”‚
  โ”‚  2. Impact โ€” Users affected, duration   โ”‚
  โ”‚  3. Timeline โ€” Minute-by-minute events  โ”‚
  โ”‚  4. Root Cause โ€” What actually failed   โ”‚
  โ”‚  5. Action Items โ€” Prevention & fixes   โ”‚
  โ”‚     - [ ] Add monitoring for X          โ”‚
  โ”‚     - [ ] Fix the bug in Y              โ”‚
  โ”‚     - [ ] Improve runbook for Z         โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿงช Quick Quiz

What is an SLO (Service Level Objective)?