When the Cloud Falls
What the 2025 AWS & Azure Outages Reveal About Our Digital Dependencies
“Reliability isn’t uptime. It’s controlled degradation.”
— Glo circa 2020 during Nextdoor’s traffic spike
TL;DR — What You’ll Learn
Why multi-AZ ≠ independence
How retry storms become self-inflicted DDoS
Why DNS caching physics drag recovery long after “fix”
How to build for brownouts, not blackouts
What leadership looks like in the first 30 minutes of chaos
Note: While this guide is written for senior technology leaders in SRE, DevOps, or infrastructure, its lessons apply to any team for whom keeping the lights on is a competitive advantage.
The Incident
This morning, a regional DNS/control-plane fault at AWS on October 20, 2025 rippled across dozens of services including banking, commerce, comms, and games. Some of these services are still down even as the root issue was spotted hours before.
This isn’t a post-mortem. It’s a Socratic dissection and runbook of what really failed: our assumptions.
How we couple systems. How we retry. How leaders steer teams mid-meltdown.
Because resilience isn’t a service feature—it’s a culture and a contract.
I. What Does “Reliable” Really Mean?
Q: If three AZs fail together, were they ever independent?
A: No. Physically separate ≠ logically isolated. Multi-AZ protects from fire and fiber, not from shared control planes or DNS.
Leadership takeaway:
When a vendor says “multi-AZ,” ask:
Which planes are actually independent: control, data, DNS, IAM?
What’s the blast-radius contract?
Hiring signal: Engineers who can draw the boundary between control and data planes will save you seven figures in downtime.
II. Why Did “Small” Become “Catastrophic”?
Q: If one endpoint wobbles, why do fifty others topple?
A: Retry storms. With K retries across N layers, traffic grows K^N.
Your own clients DDoS your dependencies.
Engineering moves:
Replace blind retries with backoff + jitter + budgets
Add token buckets and circuit breakers
Prefer stale-but-served results to timeouts
Metric to own: Retry-budget burn rate + time-to-safe-mode (TTSM)
Reliability is not about never failing—it’s about failing predictably.
III. If AWS “Fixed DNS,” Why Were We Still Down?
Q: Why does recovery lag after DNS repair?
A: Caching physics. TTLs + negative caching = bad data that lingers.
Design dials:
Short TTLs for criticals, grace caches for safety
Client-side discovery or service mesh fallback
Ship a cache-flush runbook—and rehearse it
IV. Are Regions Truly Isolated?
Q: If us-east-1 hiccups, why does Europe flinch?
A: Global anchors. IAM, Global Tables, and replication controls still hinge on one region. Geography ≠ independence.
Decision:
Pay for true multi-region control planes (latency ↑ complexity ↑)
Or accept anchor risk and design brownouts into your ops
Procurement question:
“Which AWS services backhaul to US-EAST-1 for control—and what’s our degraded mode when that anchor dies?”
V. Who’s Accountable?
Q: Is this on AWS or on us?
A: Both.
AWS sells reliability; we buy simplicity. SLA credits don’t replace trust or revenue.
Board takeaway: Resilience isn’t a line item. It’s a leadership behavior.
You pay it in architecture, culture, or reputation.
VI. What Changes on Tuesday?
Q: What does “self-healing” look like in practice?
A: Build for controlled imperfection.
🧩 Resilience Checklist (Ship This Quarter)
1. Brownouts > Blackouts
Tier features. Auto-disable the optional.
2. Backpressure Everywhere
Token buckets, queues, bulkheads.
3. Retry Safety
Idempotency keys, dedupe at sinks, enforce budgets.
4. Automatic Safing
If SLO burn > X% in Y minutes → open circuit, shed load, toggle flags.
5. Portable State
Dual-home critical data; warm-standby region; validate topology drift.
6. Client Patterns
Stale-on-error caches; request coalescing; per-call deadlines.
7. DNS Strategy
Short TTLs on criticals; alternate resolvers; mesh discovery fallback.
8. Game Days
Simulate DNS and control-plane loss. Score MTTD, MTTR, TTSM, and % served degraded.
9. Observability
Degrade telemetry gracefully; tracing shouldn’t DDoS your app.
10. Runbooks + AI Co-pilot
Automate the known. Let AI summarize state, not push buttons. Make the humans make the decisions and own the process.
VII. The Leadership Reset
Q: What should execs do in the first 30 minutes of an outage?
A: Reset the room.
Name the failure mode. No cowboy deploys.
Assign roles: IM / OPS / COMMS / LIAISON.
Set the first safe target: “Brownout to 70% in 15 min.”
Communicate every 20–30 min: show delta, not adjectives.
Culture tell:
Teams that can say “I don’t know—yet” recover faster than teams that fake certainty.
Leadership during failure isn’t authority—it’s compression. Turning chaos into clarity in one sentence.
VIII. What AI Can—and Cannot—Do Here
Can: Summarize telemetry, predict saturation, surface runbook steps, reduce page fatigue.
Cannot: Override CAP, erase TTLs, or conjure independence where control planes are shared.
Treat AI as muscle memory, not magic.
IX. The Uncomfortable Trade-Off
You can buy simplicity, or you can buy catastrophic resilience, not both.
Make the trade explicit. Price it. Revisit quarterly.
“Resilience is architecture, culture, and math. Pick two, and fund the third.”
Sources
Condensed from AWS post-event notes, Google SRE texts on cascading failures, and Oct 20 2025 coverage (Forbes, Reuters, The Guardian).
Prefer canonical docs over hot takes.
About the Author
Glo Maldonado builds systems that don’t hallucinate under pressure.
CTO & co-founder, previously scaled infra for Nextdoor, Yammer (Microsoft), and Oracle Cloud.
Writes about the craft where engineering meets economics, and where punk pragmatism meets platform leadership.
Next on SansCourier.ai
The Daughters of Martha — A field note on the quiet experts who turn 3 AM pages into non-events: midnight debuggers, maintainers, platform stewards. It’s about credit, budgets, and decision rights for the people who make the gear engage and the switches lock.
Drop soon. If this resonated, subscribe below!


