Skip to main content
  1. Posts/

AWS Just Proved It's the Internet's Biggest Single Point of Failure

·658 words·4 mins
Mandy Sidana
Author
Mandy Sidana
Exploring the intersection of business strategy and open source software development

AWS Just Proved It’s the Internet’s Biggest Single Point of Failure
#

tl;dr: AWS broke everything with a typo. Supabase ran out of servers for 10 days after raising millions. The cloud is held together with duct tape and hope.

A single DNS typo—one character—pwned 2,500+ companies simultaneously. Netflix. Reddit. Coinbase. Disney+. PlayStation. The digital economy’s entire Jenga tower collapsed because someone fat-fingered a config file.

The 130-Minute Catastrophe
#

The Problem: DynamoDB’s DNS broke. The Real Problem: Every Lambda function in existence depended on it. The Actual Problem: SQS queues backed up like a Black Friday checkout line.

The Failure Chain:

  1. DNS misconfiguration in US-EAST-1 broke DynamoDB endpoint resolution
  2. Lambda functions couldn’t find DynamoDB → started timing out
  3. SQS queues fill up because Lambdas can’t process messages
  4. Dead Letter Queues (DLQs) overflow, creating secondary backlog
  5. CloudWatch alarms trigger everywhere, PagerDuty melts down
  6. Auto-scaling groups spin up more instances to handle the “load” (spoiler: doesn’t help)

Here’s the thing nobody tells you about serverless: It’s only as reliable as its stateful dependencies. Lambda scales infinitely (in theory). DynamoDB is your bottleneck. When DynamoDB’s DNS dies, your entire serverless architecture becomes a very expensive retry machine.

Fixed in 2 hours. Recovered in 12. Why? Because fixing the typo was easy. Clearing millions of backed-up Lambda jobs? That’s like trying to unclog the entire internet with a plunger. AWS engineers: Fixed it in 130 minutes! The Queue: Hold my 47 million pending tasks

AWS outage queue

Digital Monoculture = Digital Extinction Event
#

Here’s the based take: We’ve built a system where one region going down creates a cascading failure across the entire planet. That’s not resilience. That’s a single point of failure with extra steps.

Digital monoculture ---

Supabase: The 10-Day Cope Session
#

Right after Supabase announced their massive Series E funding round—like, immediately after—their EU-2 region just… stopped working.

For 10 days. The excuse? “We ran out of nano and micro instances, and that’s mostly AWS’s fault.”

Here’s what happened: Supabase relies on AWS EC2 for their hosted PostgreSQL instances. They use t4g.nano and t4g.micro instances (ARM-based, cheap, efficient) for dev branches and smaller projects.

The problem:

  1. AWS has capacity pools per instance type per availability zone
  2. Popular instance types can get exhausted during high-demand periods
  3. There’s no SLA guaranteeing availability of any specific instance type
  4. Supabase’s entire branch-creation workflow depended on these instances being available

Meanwhile the customers:

  • Can’t restore backups ❌
  • Can’t restart instances ❌
  • Can’t create branches ❌
  • Can’t do literally any dev work ❌

Even paying customers were bricked. You could stare at your production database, but you couldn’t touch it. It’s like having a Ferrari with no gas stations within 1,000 miles.

Supabase - Not stonks

Blaming AWS didn’t land well when you’re a managed platform company and your entire value prop is “we handle the infra”. Plus you just raised millions of dollars and paying customers couldn’t work for a week and a half!

A smarter move would’ve been multi-region architecture from day one. Instead we see companies running with monoculture dependency, single vendor lock-in, zero redundancy.


The Bottom Line
#

Managed services != managed risk — Well most of the time managed services manage risk, until your vendor runs out of servers

DNS is still the internet’s Achilles heel — One typo can glass an entire region.

Your SLA is only as good as your weakest dependency — And that’s probably AWS US-EAST-1

Observability > Optimization — You can’t fix what you can’t see

The pendulum is swinging. Hard. Self-hosted infrastructure with actual resilience engineering is starting to look less like paranoia and more like basic hygiene. When a DNS typo can detonate half the internet, and a capacity shortage can paralyze production for 10 days, maybe—just maybe—we need to stop pretending “the cloud” is a magical solution and start treating it like what it is: Someone else’s computer. And it can brick at any moment.


This article was originally published on Substack as part of the BoFOSS publication.