AWS us-east-1 Outage Disrupted Major Apps; Amazon Cites DNS Issue With DynamoDB (Now Resolved)
A widespread incident in Amazon Web Services’ us-east-1 region on October 20 (ET) caused increased error rates and latency across multiple services, temporarily disrupting major apps including Snapchat, Venmo, Lyft, Fortnite—and even Amazon’s own Alexa. By the afternoon, AWS said most services had returned to normal, with request backlogs clearing into the evening. Amazon later confirmed the broader resolution of the issue.
What happened
AWS identified the trigger as a DNS resolution issue impacting DynamoDB endpoints. That meant many apps could not reliably reach their databases, causing timeouts, elevated API errors and cascading failures across dependent services.
Timeline highlights (ET)
- 3:11 AM: AWS reports increased error rates/latencies for multiple services in us-east-1.
- ~5:01 AM: Root cause identified: DNS resolution issue for DynamoDB APIs; mitigations begin.
- 6:35 AM: DNS issue mitigated; residual impacts remain, especially for new EC2 instance launches.
- 8:48–10:14 AM: Progress continues; AWS rate-limits new EC2 launches to stabilize recovery.
- 3:01 PM: AWS reports services back to normal operations; backlogs processing.
- 6:53 PM: Amazon notes resolution of the widespread errors and latencies.
Why it mattered
us-east-1 is one of AWS’s most heavily used regions. When DNS can’t resolve critical service endpoints like DynamoDB, applications are effectively “separated” from their data for a period—rippling across a large portion of the internet and impacting both consumer apps and enterprise systems.
Reportedly affected during the incident
- Alexa voice requests and routines
- Snapchat, Venmo, Lyft
- Fortnite, Roblox
- Streaming and media apps (e.g., Disney+) and numerous websites
Builder takeaways
- Evaluate multi-region designs where RTO/RPO demand regional resilience; avoid single-region dependencies on critical data paths.
- Avoid hard-coding deployments to specific Availability Zones to maximize failover flexibility.
- Harden clients with exponential backoff, circuit breakers, and graceful degradation.
- Review DNS and resolver caching strategies; regularly run DR and chaos drills.
References
Discussion: Will this outage accelerate multi-region or multi-cloud adoption, or do cost and complexity still outweigh the resilience benefits for most teams?
