AWS outage traced to a DynamoDB automation bug that broke DNS and took many services offline
Amazon has published a detailed report on the Oct. 20 outage that disrupted many websites, apps and online services. According to the report, the incident began with a bug in the automation that manages DynamoDB’s DNS records for the US East (Northern Virginia) region. That bug produced an empty DNS record for key data‑center addresses and the automated repair process failed to fix it, forcing Amazon to intervene manually.
Because DynamoDB stores and manages hundreds of thousands of DNS records used by other AWS systems and customer workloads, the DNS failure cascaded across dependent systems. Services that could not reach DynamoDB experienced DNS failures, leading to partial or total outages for numerous popular services.
Who was affected
- Major consumer services experienced slowdowns or outages — examples include Amazon’s own sites and Alexa devices, Bank of America, Snapchat, Canva, Reddit, Apple Music, Apple TV, Lyft, Duolingo, Fortnite, Disney+, Venmo, DoorDash, Hulu, PlayStation and IoT vendors such as Eight Sleep.
Root cause and technical summary
- Automation bug: A defect in DynamoDB’s DNS management automation created empty DNS records for data‑center endpoints.
- Failed self‑healing: The automation was expected to detect and repair the issue automatically but did not, requiring manual remediation.
- Cascading DNS failures: Systems and customer applications that relied on those records could not resolve necessary endpoints, causing widespread disruption.
Why the outage was so wide‑ranging
Modern cloud platforms are highly interconnected. A failure in a central control or metadata service (like a DNS management system) can propagate rapidly because many other components assume that metadata is correct and available. When self‑healing mechanisms fail, human intervention can be slower than automatic recovery, increasing outage scope and duration.
Mitigations and lessons learned
- Improve automation safeguards: Add more robust testing, canarying and circuit breakers so automation faults are detected and isolated before they affect production records.
- Redundancy & isolation: Reduce single points of failure for critical control planes (e.g., split DNS management or add orthogonal validation paths).
- Faster runbooks & tooling: Streamline manual fallback procedures and tooling to reduce time to remediation when automation fails.
- Transparent post‑incident reporting: Clear reports help customers learn impacts and adjust their architectures to be more resilient.
Amazon acknowledged the impact and apologized, committing to learn from the event and improve availability. For the official post‑mortem and technical details, see Amazon’s incident report or AWS Service Health communications.
Discussion: If you run services in the cloud, what changes will you make to protect against similar cascading failures — multi‑region DNS strategies, additional caching, or different dependency topologies? Share your approach.
