Massive AWS Outage Disrupts Major Apps; DNS Issue in us-east-1 Blamed
A widespread Amazon Web Services (AWS) outage took many apps and sites offline or made them sluggish, including Alexa, Venmo, Lyft, Snapchat, Canva, Fortnite, Reddit, Disney+, Apple Music, Pinterest, Roblox, banks, airlines, and news outlets. The incident centered on the heavily used US-EAST-1 (N. Virginia) region.
AWS identified the trigger as DNS resolution issues for DynamoDB API endpoints. While the DNS problem was mitigated early in the day, cascading dependencies led to lingering issues—most notably with EC2 instance launches—as AWS temporarily rate-limited new launches in us-east-1 to aid recovery. Amazon later said services were restored, with backlogs clearing over time.
- Root cause: DNS resolution issues affecting regional DynamoDB endpoints in us-east-1.
- Timeline (ET): Errors began overnight; DNS mitigated in the morning; EC2 launch issues persisted into the afternoon; AWS reported full resolution by evening.
- Impact: Increased error rates and connectivity problems across multiple AWS services and popular consumer apps.
- Remediation: Rate-limiting new EC2 launches; guidance to avoid AZ-specific deployments during restoration.
Why it matters
As of mid-2025, AWS holds roughly 30% of the global cloud infrastructure market. Outages in us-east-1 can ripple across the internet due to concentration risk. The incident highlights the need for multi-region architectures, graceful degradation, and tested failover plans for critical workloads.
Helpful links
- AWS Service Health Dashboard
- Reddit Status · Epic Games (Fortnite) Status · Snapchat Status
- Downdetector (service reports)
Takeaways for teams on AWS
- Design for region-level failures (active-active or warm standby across regions).
- Avoid single-AZ coupling and use health checks/deployments that can bypass a failing zone.
- Harden DNS and service discovery to tolerate resolver or endpoint issues.
- Run regular game-day failover drills and understand provider rate limits during recoveries.
One expert analogy: the data was safe, but apps temporarily “couldn’t find it,” showing how DNS failures can sever access to otherwise healthy backends. Even resilient designs can feel the impact when a widely used region stumbles.
Discussion: What resiliency improvements will you prioritize after this outage—multi-region, alternative cloud providers, or better incident communications?
