Alerting Strategies

Designing effective alerting rules and escalation policies

Alerting Strategies

Effective alerting means being notified about problems before users are affected. Good alerts are actionable, specific, and not overwhelming.

Alert Severity Levels


  ┌─────────────────────────────────────────────┐
  │  Severity  │  Example               │ Action │
  ├─────────────────────────────────────────────┤
  │  Critical  │  Service is down       │ Page   │
  │  Warning   │  High error rate       │ Slack  │
  │  Info      │  Deployment completed  │ Log    │
  └─────────────────────────────────────────────┘

  Critical → On-call engineer → Immediate response
  Warning  → Team channel    → Investigate soon
  Info     → Dashboard/Log   → Review later

Prometheus Alerting Rules


  # alerts.yml
  groups:
  - name: application
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }}% for {{ $labels.service }}"

    - alert: HighLatency
      expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.service }}"

Alerting Best Practices

Alert on symptoms, not causes — Alert on high latency, not on CPU usage
Use multi-window alerting — Combine short and long time windows
Runbook every alert — Each alert should link to remediation steps
Reduce noise — Group related alerts, use inhibition rules
Test your alerts — Use tools like Alertmanager's amtool

Alert Flow


  Prometheus ──▶ Alertmanager ──▶ Notification
                    │
                    ├── Route by severity
                    ├── Group related alerts
                    ├── Inhibit low-priority
                    ├── Silence during maintenance
                    └── Send to:
                        ├── PagerDuty (Critical)
                        ├── Slack (Warning)
                        ├── Email (Info)
                        └── Webhook (Custom)

🧪 Quick Quiz

What makes a good alerting rule?

← Previous Distributed Tracing

Next → SRE Practices