Labs ICT
โญ Pro Login

Alerting Strategies

Designing effective alerting rules and escalation policies

Alerting Strategies

Effective alerting means being notified about problems before users are affected. Good alerts are actionable, specific, and not overwhelming.

Alert Severity Levels


  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Severity  โ”‚  Example               โ”‚ Action โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚  Critical  โ”‚  Service is down       โ”‚ Page   โ”‚
  โ”‚  Warning   โ”‚  High error rate       โ”‚ Slack  โ”‚
  โ”‚  Info      โ”‚  Deployment completed  โ”‚ Log    โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  Critical โ†’ On-call engineer โ†’ Immediate response
  Warning  โ†’ Team channel    โ†’ Investigate soon
  Info     โ†’ Dashboard/Log   โ†’ Review later

Prometheus Alerting Rules


  # alerts.yml
  groups:
  - name: application
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }}% for {{ $labels.service }}"

    - alert: HighLatency
      expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.service }}"

Alerting Best Practices

  • Alert on symptoms, not causes โ€” Alert on high latency, not on CPU usage
  • Use multi-window alerting โ€” Combine short and long time windows
  • Runbook every alert โ€” Each alert should link to remediation steps
  • Reduce noise โ€” Group related alerts, use inhibition rules
  • Test your alerts โ€” Use tools like Alertmanager's amtool

Alert Flow


  Prometheus โ”€โ”€โ–ถ Alertmanager โ”€โ”€โ–ถ Notification
                    โ”‚
                    โ”œโ”€โ”€ Route by severity
                    โ”œโ”€โ”€ Group related alerts
                    โ”œโ”€โ”€ Inhibit low-priority
                    โ”œโ”€โ”€ Silence during maintenance
                    โ””โ”€โ”€ Send to:
                        โ”œโ”€โ”€ PagerDuty (Critical)
                        โ”œโ”€โ”€ Slack (Warning)
                        โ”œโ”€โ”€ Email (Info)
                        โ””โ”€โ”€ Webhook (Custom)

๐Ÿงช Quick Quiz

What makes a good alerting rule?