AWS Outage: DNS issue in us‑east‑1 knocked major services offline — now resolved
A widespread Amazon Web Services (AWS) incident on October 20 temporarily disrupted a long list of apps and sites, including Venmo, Snapchat, Canva, Fortnite and even Amazon’s own Alexa smart assistant. AWS traced the trigger to a DNS resolution issue affecting DynamoDB service endpoints in the US‑EAST‑1 (N. Virginia) region. By the evening, Amazon said services had returned to normal and the incident was resolved.
What happened — a concise timeline
- 3:11 AM ET: AWS reports increased error rates and latencies across multiple services in US‑EAST‑1.
- 5:01 AM ET: Root cause identified as DNS resolution issues for DynamoDB API endpoints.
- 2:24 AM PT (5:24 AM ET): AWS says the underlying DNS issue was fully mitigated, but recovery work continued.
- Morning–midday: Knock‑on effects impacted other services, especially new EC2 instance launches; AWS rate‑limited launches to aid recovery.
- 3:01 PM PT: AWS reports all services returned to normal operations; backlogs continued clearing thereafter.
Why it mattered
US‑EAST‑1 is one of AWS’s most heavily used regions. When it stumbles, the blast radius is large. As one expert told CNN, “Amazon had the data safely stored, but nobody else could find it for several hours,” leaving apps temporarily separated from their data. The incident underscored how central cloud providers are to the modern web — and the risks of concentration in a single region.
Who was affected (selection)
- Amazon services (Alexa), Amazon.com operations
- Venmo, Lyft, DoorDash, Grubhub
- Snapchat, Reddit, Pinterest
- Disney+, Hulu, Apple Music, Apple TV
- Fortnite, Roblox, PlayStation services
- Banks and airline websites (varied reports)
Resilience takeaways
- Architect beyond one region: Use multi‑AZ and consider active‑active multi‑region designs for critical apps.
- Harden DNS paths: Validate resolver redundancy, TTLs and fallback behavior when service endpoints fail.
- Graceful degradation: Support read‑only modes, local caches and queueing to keep core UX alive during partial outages.
- Back‑pressure & retries: Expect throttling during recovery; implement exponential backoff and idempotency.
- Exercise DR plans: Run regular failover and chaos drills to expose weak links before real incidents.
Status and references
- AWS Service Health Dashboard
- Incident coverage and timeline
- Downdetector (third‑party outage reports)
Bottom line: The outage is resolved, but it’s a fresh reminder to audit single‑region dependencies and validate failover plans before the next incident.
Discussion: What’s your strategy for the next regional cloud outage — multi‑region, multi‑cloud, or smarter edge caching?
