High Availability and Fault Tolerance

Designing systems that stay up even when things fail

High Availability and Fault Tolerance

High Availability (HA) means your system remains operational for a high percentage of time. Fault Tolerance means the system continues operating even when components fail. Together, they ensure your applications stay up when things go wrong.

Key Concepts


  ┌──────────────────────────────────────────────────────┐
  │  Availability = Uptime / Total Time × 100            │
  │                                                      │
  │  99.9%  = 8.76 hours downtime/year  (Three 9s)      │
  │  99.99% = 52.6 minutes downtime/year (Four 9s)      │
  │  99.999%= 5.26 minutes downtime/year (Five 9s)      │
  │                                                      │
  │  Each "9" is 10x more expensive to achieve           │
  └──────────────────────────────────────────────────────┘

AWS High Availability Architecture


  ┌──────────────────────────────────────────────────────┐
  │                                                      │
  │           Region: us-east-1                          │
  │  ┌─────────────────┐    ┌─────────────────┐         │
  │  │  AZ-a           │    │  AZ-b           │         │
  │  │  ┌───────────┐  │    │  ┌───────────┐  │         │
  │  │  │    ALB    │  │    │  │    ALB    │  │         │
  │  │  │  (active) │  │    │  │ (standby) │  │         │
  │  │  └─────┬─────┘  │    │  └─────┬─────┘  │         │
  │  │  ┌─────┴─────┐  │    │  ┌─────┴─────┐  │         │
  │  │  │  EC2      │  │    │  │  EC2      │  │         │
  │  │  │  Instance │  │    │  │  Instance │  │         │
  │  │  └───────────┘  │    │  └───────────┘  │         │
  │  │  ┌───────────┐  │    │  ┌───────────┐  │         │
  │  │  │   RDS     │  │◄───│─▶│   RDS     │  │         │
  │  │  │  Primary  │  │sync│  │  Standby  │  │         │
  │  │  └───────────┘  │    │  └───────────┘  │         │
  │  └─────────────────┘    └─────────────────┘         │
  │                                                      │
  │  If AZ-a fails → ALB routes to AZ-b                 │
  │  RDS fails over to standby automatically             │
  └──────────────────────────────────────────────────────┘

Strategies for High Availability

Redundancy: Run multiple instances of everything. If one fails, another takes over.

Multi-AZ: Spread resources across availability zones within a region.

Multi-Region: Deploy across regions for disaster recovery and global performance.

Health Checks: Automatically detect and replace unhealthy instances.

Load Balancing: Distribute traffic so no single instance is a bottleneck.

Common Anti-Patterns

Running everything in a single AZ (single point of failure). Not testing failover procedures. Assuming a single database instance is enough. Not monitoring system health. Ignoring DNS TTL for faster failover.

🧪 Quick Quiz

What is high availability in cloud architecture?

← Previous RDS and Database Services

Next → Auto Scaling and Elasticity