AWS us-east-1 Outage Disrupted Major Apps; Amazon Cites DNS Issue With DynamoDB (Now Resolved)
A widespread incident in Amazon Web Services’ us-east-1 region caused increased error rates and latency across multiple services, temporarily disrupting major apps including Snapchat, Venmo, Lyft, Fortnite—and even Amazon’s own Alexa. AWS attributed the trigger to DNS resolution issues affecting DynamoDB endpoints. Amazon later said services had returned to normal operations, with backlogs clearing through the afternoon and evening.
Key timeline
- 3:11 AM ET: AWS reports elevated errors/latencies in us-east-1.
- ~5:01 AM ET: Root cause identified: DNS resolution issue for DynamoDB APIs; mitigations begin.
- 6:35 AM ET: DNS issue mitigated; residual impacts persist (notably new EC2 instance launches).
- 8:48–10:14 AM ET: Progress continues; AWS rate-limits new EC2 instance launches to aid recovery.
- 3:01 PM ET: AWS states services have returned to normal operations; backlogs processing.
- Evening update: Amazon notes resolution of widespread errors and latencies.
Why this mattered
us-east-1 is among AWS’s most heavily utilized regions. DNS failures to DynamoDB effectively left many apps “separated” from their data/control planes, creating cascading issues. Outages in a single hyperscale region can ripple across large portions of the internet.
What we’re watching
- How organizations revisit multi-region or multi-cloud strategies for critical workloads.
- Improvements to DNS resilience, resolver caching, and failover patterns.
- Operational backlogs and delayed deployments following rate-limited EC2 launches.
Builder takeaways
- Design for regional failure: evaluate multi-region architectures where RTO/RPO demand it.
- Avoid hard-coding to specific Availability Zones; enable flexible placement and failover.
- Implement exponential backoff, circuit breakers, and graceful degradation.
- Regularly test disaster recovery and chaos drills; review DNS and dependency maps.
References
Discussion: Will this incident meaningfully accelerate multi-region adoption, or do cost/complexity barriers still outweigh the risk for most teams?
