Disaster Recovery in Cloud

Planning for failures and recovering from disasters

Disaster Recovery in Cloud

Disaster recovery (DR) is about planning for the worst. Whether it's a hardware failure, natural disaster, or cyberattack, a solid DR plan ensures your business can recover quickly with minimal data loss.

Key DR Metrics


  ┌──────────────────────────────────────────────────────┐
  │  RTO (Recovery Time Objective)                       │
  │  ─────────────────────────────                       │
  │  Maximum acceptable time to restore service          │
  │  "How fast do we need to be back up?"                │
  │                                                      │
  │  RPO (Recovery Point Objective)                      │
  │  ─────────────────────────────                       │
  │  Maximum acceptable data loss (measured in time)     │
  │  "How much data can we afford to lose?"              │
  │                                                      │
  │  Example:                                            │
  │  RTO = 4 hours, RPO = 1 hour                        │
  │  → Must recover within 4 hours, losing max 1 hour   │
  │    of data                                           │
  └──────────────────────────────────────────────────────┘

DR Strategies


  ┌─────────────────────────────────────────────────────┐
  │  Strategy        │  RTO    │  RPO    │  Cost       │
  ├──────────────────┼─────────┼─────────┼─────────────┤
  │  Backup &        │  Hours  │  Hours  │  Lowest     │
  │  Restore         │         │         │             │
  ├──────────────────┼─────────┼─────────┼─────────────┤
  │  Pilot Light     │  10s of │  Minutes│  Moderate   │
  │                  │  mins   │         │             │
  ├──────────────────┼─────────┼─────────┼─────────────┤
  │  Warm Standby    │  Minutes│  Seconds│  Higher     │
  │                  │         │         │             │
  ├──────────────────┼─────────┼─────────┼─────────────┤
  │  Multi-Site      │  Near   │  Near   │  Highest    │
  │  Active-Active   │  zero   │  zero   │             │
  └──────────────────┴─────────┴─────────┴─────────────┘

Backup and Restore

The simplest DR strategy. Take regular backups (S3 snapshots, RDS automated backups) and restore them when disaster strikes. Good for non-critical workloads with lenient RTO/RPO requirements.

Pilot Light

Keep a minimal version of your environment running in another region. When disaster strikes, scale up quickly. Example: A warm database replica and pre-configured AMIs ready to launch EC2 instances.

Warm Standby

Run a scaled-down but fully functional copy of your environment in another region. Scale up during disasters. Provides faster recovery than Pilot Light.

Multi-Site Active-Active

Run full copies of your environment in multiple regions, serving production traffic simultaneously. Provides the best RTO and RPO but at the highest cost.

Testing Your DR Plan

A DR plan you haven't tested is just a document. Regularly simulate failures, practice failover procedures, and time your recovery. Use AWS Fault Injection Simulator for chaos engineering. Document lessons learned and improve continuously.

🧪 Quick Quiz

What is the Recovery Time Objective (RTO)?

← Previous AWS Well-Architected Framework

Next → Cloud Security Fundamentals