High Availability and Fault Tolerance
High Availability (HA) means your system remains operational for a high percentage of time. Fault Tolerance means the system continues operating even when components fail. Together, they ensure your applications stay up when things go wrong.
Key Concepts
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Availability = Uptime / Total Time Γ 100 β
β β
β 99.9% = 8.76 hours downtime/year (Three 9s) β
β 99.99% = 52.6 minutes downtime/year (Four 9s) β
β 99.999%= 5.26 minutes downtime/year (Five 9s) β
β β
β Each "9" is 10x more expensive to achieve β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS High Availability Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Region: us-east-1 β
β βββββββββββββββββββ βββββββββββββββββββ β
β β AZ-a β β AZ-b β β
β β βββββββββββββ β β βββββββββββββ β β
β β β ALB β β β β ALB β β β
β β β (active) β β β β (standby) β β β
β β βββββββ¬ββββββ β β βββββββ¬ββββββ β β
β β βββββββ΄ββββββ β β βββββββ΄ββββββ β β
β β β EC2 β β β β EC2 β β β
β β β Instance β β β β Instance β β β
β β βββββββββββββ β β βββββββββββββ β β
β β βββββββββββββ β β βββββββββββββ β β
β β β RDS β ββββββββΆβ RDS β β β
β β β Primary β βsyncβ β Standby β β β
β β βββββββββββββ β β βββββββββββββ β β
β βββββββββββββββββββ βββββββββββββββββββ β
β β
β If AZ-a fails β ALB routes to AZ-b β
β RDS fails over to standby automatically β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Strategies for High Availability
Redundancy: Run multiple instances of everything. If one fails, another takes over.
Multi-AZ: Spread resources across availability zones within a region.
Multi-Region: Deploy across regions for disaster recovery and global performance.
Health Checks: Automatically detect and replace unhealthy instances.
Load Balancing: Distribute traffic so no single instance is a bottleneck.
Common Anti-Patterns
Running everything in a single AZ (single point of failure). Not testing failover procedures. Assuming a single database instance is enough. Not monitoring system health. Ignoring DNS TTL for faster failover.