AWS US-EAST-1 Outage Disrupts Major Apps; Amazon Cites DNS Issue With DynamoDB (Now Resolved)
A widespread Amazon Web Services outage in the US-EAST-1 region on October 20 (ET) temporarily knocked many popular apps and services offline, including Snapchat, Venmo, Lyft, Fortnite and even Amazon’s own Alexa. Amazon says the root trigger was a DNS resolution issue affecting DynamoDB endpoints, which caused elevated error rates and cascading service failures. By mid-afternoon, AWS reported services had returned to normal, with backlogs clearing into the evening.
Timeline highlights
- 3:11 AM ET: AWS reports increased error rates and latencies across multiple services in US-EAST-1.
- ~5:01 AM ET: Root cause identified as DNS resolution issues for DynamoDB APIs; mitigations begin.
- 6:35 AM ET: DNS issue mitigated; residual impacts persist, notably with new EC2 instance launches.
- 8:48–10:14 AM ET: Elevated API errors continue; AWS rate-limits new EC2 instance launches to aid recovery.
- 3:01 PM ET: AWS says services are back to normal operations, with queues/backlogs processing thereafter.
- 6:53 PM ET: Amazon confirms resolution of widespread errors and latencies.
What happened and why it mattered
US-EAST-1 is one of AWS’s most heavily used regions, so outages there ripple across a large share of the internet. DNS failures prevented clients from reliably reaching DynamoDB endpoints, effectively separating many applications from their data and control planes for hours. Incidents like this highlight the concentration risk of relying on a small number of hyperscale providers and a single region for critical workloads.
Broader context
- AWS held about 30% of the global cloud infrastructure market as of mid-2025.
- Knock-on effects extended to services like EC2, where new instance launches were temporarily rate-limited during recovery.
Builder takeaways
- Map and reduce single-region dependencies (e.g., DynamoDB, Lambda, ECS), and evaluate multi-region architectures where RTO/RPO require it.
- Avoid hard-coding deployments to specific Availability Zones; design for flexible placement and failover.
- Implement exponential backoff, circuit breakers and graceful degradation to handle upstream errors.
- Review DNS and resolver caching strategies, and continuously test disaster recovery/chaos drills.
References
Discussion: Will this incident push more teams to adopt multi-region or multi-cloud strategies, or are the costs and complexity still too high for most workloads?
