Cloud Monitoring
You can't fix what you can't see. Cloud monitoring gives you visibility into your infrastructure, applications, and costs. Without it, you're flying blind β discovering problems only when users complain.
The Three Pillars of Observability
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OBSERVABILITY PILLARS β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β LOGS β β METRICS β β TRACES β β
β β β β β β β β
β β Event recordsβ β Numeric data β β Request flow β β
β β with context β β over time β β across β β
β β β β β β services β β
β β "What β β "How is β β "Where is β β
β β happened" β β the system β β latency" β β
β β β β performing" β β β β
β β CloudWatch β β CloudWatch β β X-Ray β β
β β Logs, β β Metrics, β β β β
β β CloudTrail β β Prometheus β β Jaeger, β β
β β β β β β Zipkin β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Together they answer: What happened, why, and where? β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS CloudWatch
CloudWatch is AWS's native monitoring service. It collects metrics from every AWS resource, stores logs, and lets you set alarms. Think of it as the nervous system of your cloud environment β it detects and reports on everything.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLOUDWATCH COMPONENTS β
β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Metrics β β Logs β β Alarms β β
β β β β β β β β
β β CPU Util% β β Applicationβ β Threshold β β
β β Memory β β logs β β breached? β β
β β Network I/Oβ β VPC Flow β β ββββββ β β
β β Custom β β logs β ββββΆβSNS β β β
β β metrics β β CloudTrail β β ββββββ β β
β ββββββββββββββ ββββββββββββββ β Send alertβ β
β ββββββββββββββ β
β β
β ββββββββββββββ ββββββββββββββ β
β β Dashboards β β Events β β
β β β β β β
β β Visualize β β React to β β
β β everything β β changes β β
β β on one β β (auto- β β
β β screen β β remediate)β β
β ββββββββββββββ ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS CloudTrail
CloudTrail records every API call made in your AWS account. Who did what, when, and from where. It's your audit trail β essential for security investigations, compliance, and debugging access issues.
API Call Flow with CloudTrail:
User/Service βββΆ AWS API βββΆ CloudTrail βββΆ S3 Bucket
β β
β βΌ
β ββββββββββββ
β β Log β
β β File β
β β β
β β Event: β
β β RunInst- β
β β ances β
β β User: β
β β admin@co β
β β m β
β β Time: β
β β 2024-... β
β β SourceIP:β
β β 1.2.3.4 β
β ββββββββββββ
βΌ
AWS performs action
Logging Strategy
A good logging strategy captures the right data at the right level. Too little logging and you can't debug issues. Too much and you're paying for noise. Aim for structured logs with consistent formats.
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LOG LEVELS GUIDE β
β β
β Level When to Use Retention β
β βββββ ββββββββββ ββββββββββ β
β ERROR Failures that need Long-term β
β immediate attention β
β β
β WARN Unexpected but not Medium-term β
β critical β
β β
β INFO Key business events Short-term β
β and milestones β
β β
β DEBUG Detailed diagnostic Very short β
β info (dev/staging) (hours/days) β
β β
β TRACE Everything Debug only β
β β
β Best Practice: β
β Production: ERROR + WARN + INFO β
β Staging: + DEBUG β
β Dev: + TRACE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alerting and Dashboards
Alerts notify you when something goes wrong. Dashboards show you the big picture. Together they keep you informed and in control. The key is setting alerts that matter β too many false alarms and you'll start ignoring them.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DASHBOARD EXAMPLE β
β β
β CPU Utilization Memory Usage Request Count β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β β±β² β β βββ
ββ
ββ β β βββββββ β β
β β β± β² β β β β βββββββ β β
β β β± β²β±β² β β β β βββββββ β β
β β β± β² β β β β βββββββ β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β Current: 62% Current: 4.1GB Current: 1.2kβ
β Status: OK Status: OK Status: OK β
β β
β Error Rate Latency (p99) Cost Today β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β ββββββββββββ β β±ββββ² β β $142.50 β β
β ββββββββββββ β β± β² β β Budget:$200β β
β ββββββββββββ β β± β²β β ββββββββββ β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β Current: 0.1% Current: 234ms Current: 71% β
β Status: OK Status: OK Status: OK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Monitoring Best Practices
[x] Monitor the four golden signals: latency, traffic,
errors, and saturation
[x] Set up alerts for symptoms (high latency) not just
causes (high CPU)
[x] Use composite alarms to reduce noise (alert only when
multiple conditions are true)
[x] Store logs centrally (CloudWatch Logs, OpenSearch)
[x] Retain metrics for at least 13 months for trending
[x] Use synthetic monitoring to catch issues before users
[x] Tag everything so you can filter dashboards by
environment, service, or team
[x] Run chaos experiments to validate your monitoring
actually detects failures