Alerting Strategies
Effective alerting means being notified about problems before users are affected. Good alerts are actionable, specific, and not overwhelming.
Alert Severity Levels
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Severity โ Example โ Action โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Critical โ Service is down โ Page โ
โ Warning โ High error rate โ Slack โ
โ Info โ Deployment completed โ Log โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Critical โ On-call engineer โ Immediate response
Warning โ Team channel โ Investigate soon
Info โ Dashboard/Log โ Review later
Prometheus Alerting Rules
# alerts.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
Alerting Best Practices
- Alert on symptoms, not causes โ Alert on high latency, not on CPU usage
- Use multi-window alerting โ Combine short and long time windows
- Runbook every alert โ Each alert should link to remediation steps
- Reduce noise โ Group related alerts, use inhibition rules
- Test your alerts โ Use tools like Alertmanager's amtool
Alert Flow
Prometheus โโโถ Alertmanager โโโถ Notification
โ
โโโ Route by severity
โโโ Group related alerts
โโโ Inhibit low-priority
โโโ Silence during maintenance
โโโ Send to:
โโโ PagerDuty (Critical)
โโโ Slack (Warning)
โโโ Email (Info)
โโโ Webhook (Custom)