Table of Contents

AWS Just Proved It’s the Internet’s Biggest Single Point of Failure
#

tl;dr: AWS broke everything with a typo. Supabase ran out of servers for 10 days after raising millions. The cloud is held together with duct tape and hope.

A single DNS typo—one character—pwned 2,500+ companies simultaneously. Netflix. Reddit. Coinbase. Disney+. PlayStation. The digital economy’s entire Jenga tower collapsed because someone fat-fingered a config file.

The 130-Minute Catastrophe
#

The Problem: DynamoDB’s DNS broke. The Real Problem: Every Lambda function in existence depended on it. The Actual Problem: SQS queues backed up like a Black Friday checkout line.

The Failure Chain:

DNS misconfiguration in US-EAST-1 broke DynamoDB endpoint resolution
Lambda functions couldn’t find DynamoDB → started timing out
SQS queues fill up because Lambdas can’t process messages
Dead Letter Queues (DLQs) overflow, creating secondary backlog
CloudWatch alarms trigger everywhere, PagerDuty melts down
Auto-scaling groups spin up more instances to handle the “load” (spoiler: doesn’t help)

Here’s the thing nobody tells you about serverless: It’s only as reliable as its stateful dependencies. Lambda scales infinitely (in theory). DynamoDB is your bottleneck. When DynamoDB’s DNS dies, your entire serverless architecture becomes a very expensive retry machine.

Fixed in 2 hours. Recovered in 12. Why? Because fixing the typo was easy. Clearing millions of backed-up Lambda jobs? That’s like trying to unclog the entire internet with a plunger. AWS engineers: Fixed it in 130 minutes! The Queue: Hold my 47 million pending tasks

Digital Monoculture = Digital Extinction Event
#

Here’s the based take: We’ve built a system where one region going down creates a cascading failure across the entire planet. That’s not resilience. That’s a single point of failure with extra steps.

---

Supabase: The 10-Day Cope Session
#

Right after Supabase announced their massive Series E funding round—like, immediately after—their EU-2 region just… stopped working.

For 10 days. The excuse? “We ran out of nano and micro instances, and that’s mostly AWS’s fault.”

Here’s what happened: Supabase relies on AWS EC2 for their hosted PostgreSQL instances. They use t4g.nano and t4g.micro instances (ARM-based, cheap, efficient) for dev branches and smaller projects.

The problem:

AWS has capacity pools per instance type per availability zone
Popular instance types can get exhausted during high-demand periods
There’s no SLA guaranteeing availability of any specific instance type
Supabase’s entire branch-creation workflow depended on these instances being available

Meanwhile the customers:

Can’t restore backups ❌
Can’t restart instances ❌
Can’t create branches ❌
Can’t do literally any dev work ❌

Even paying customers were bricked. You could stare at your production database, but you couldn’t touch it. It’s like having a Ferrari with no gas stations within 1,000 miles.

Blaming AWS didn’t land well when you’re a managed platform company and your entire value prop is “we handle the infra”. Plus you just raised millions of dollars and paying customers couldn’t work for a week and a half!

A smarter move would’ve been multi-region architecture from day one. Instead we see companies running with monoculture dependency, single vendor lock-in, zero redundancy.

The Bottom Line
#

Managed services != managed risk — Well most of the time managed services manage risk, until your vendor runs out of servers

DNS is still the internet’s Achilles heel — One typo can glass an entire region.

Your SLA is only as good as your weakest dependency — And that’s probably AWS US-EAST-1

Observability > Optimization — You can’t fix what you can’t see

The pendulum is swinging. Hard. Self-hosted infrastructure with actual resilience engineering is starting to look less like paranoia and more like basic hygiene. When a DNS typo can detonate half the internet, and a capacity shortage can paralyze production for 10 days, maybe—just maybe—we need to stop pretending “the cloud” is a magical solution and start treating it like what it is: Someone else’s computer. And it can brick at any moment.

This article was originally published on Substack as part of the BoFOSS publication.

AWS Just Proved It’s the Internet’s Biggest Single Point of Failure #

The 130-Minute Catastrophe #

Digital Monoculture = Digital Extinction Event #

Supabase: The 10-Day Cope Session #

The Bottom Line #

AWS Just Proved It’s the Internet’s Biggest Single Point of Failure
#

The 130-Minute Catastrophe
#

Digital Monoculture = Digital Extinction Event
#

Supabase: The 10-Day Cope Session
#

The Bottom Line
#