
Yesterday’s outage showed how dependent the modern web is on a handful of core infrastructure providers.
In fact, it’s so dependent that a single configuration error made large parts of the internet totally unreachable for several hours.
Many of us work in crypto because we understand the dangers of centralization in finance, but the events of yesterday were a clear reminder that centralization at the internet’s core is just as urgent a problem to solve.
The obvious giants like Amazon, Google, and Microsoft run enormous chunks of cloud infrastructure.
But equally critical are firms like Cloudflare, Fastly, Akamai, DigitalOcean, and CDN (servers that deliver websites faster around the world) or DNS (the “address book” of the internet) providers such as UltraDNS and Dyn.
Most people barely know their names, yet their outages can be just as crippling, as we saw yesterday.
To start with, here’s a list of companies you may never have heard of that are critical to keeping the internet running as expected.
Category
Company
What They Control
Impact If They Go Down
Core Infra (DNS/CDN/DDoS)
Cloudflare
CDN, DNS, DDoS protection, Zero Trust, Workers
Huge portions of global web traffic fail; thousands of sites become unreachable.
Core Infra (CDN)
Akamai
Enterprise CDN for banks, logins, commerce
Major enterprise services, banks, and login systems break.
Core Infra (CDN)
Fastly
CDN, edge compute
Global outage potential (as seen in 2021: Reddit, Shopify, gov.uk, NYT).
Cloud Provider
AWS
Compute, hosting, storage, APIs
SaaS apps, streaming platforms, fintech, and IoT networks fail.
Cloud Provider
Google Cloud
YouTube, Gmail, enterprise backends
Massive disruption across Google services and dependent apps.
Cloud Provider
Microsoft Azure
Enterprise & government clouds
Office365, Teams, Outlook, and Xbox Live outages.
DNS Infrastructure
Verisign
.com & .net TLDs, root DNS
Catastrophic global routing failures for large parts of the web.
DNS Providers
GoDaddy / Cloudflare / Squarespace
DNS management for millions of domains
Entire companies vanish from the internet.
Certificate Authority
Let’s Encrypt
TLS certificates for most of the web
HTTPS breaks globally; users see security errors everywhere.
Certificate Authority
DigiCert / GlobalSign
Enterprise SSL
Large corporate sites lose HTTPS trust.
Security / CDN
Imperva
DDoS, WAF, CDN
Protected sites become inaccessible or vulnerable.
Load Balancers
F5 Networks
Enterprise load balancing
Banking, hospitals, and government services can fail nationwide.
Tier-1 Backbone
Lumen (Level 3)
Global internet backbone
Routing issues cause global latency spikes and regional outages.
Tier-1 Backbone
Cogent / Zayo / Telia
Transit and peering
Regional or country-level internet disruptions.
App Distribution
Apple App Store
iOS app updates & installs
iOS app ecosystem effectively freezes.
App Distribution
Google Play Store
Android app distribution
Android apps cannot install or update globally.
Payments
Stripe
Web payments infrastructure
Thousands of apps lose the ability to accept payments.
Identity / Login
Auth0 / Okta
Authentication & SSO
Logins break for thousands of apps.
Communications
Twilio
2FA SMS, OTP, messaging
Large portion of global 2FA and OTP codes fail.
What happened yesterday
Yesterday’s culprit was Cloudflare, a company that routes almost 20% of all web traffic.
It now says the outage started with a small database configuration change that accidentally caused a bot-detection file to include duplicate items.
That file suddenly grew beyond a strict size limit. When Cloudflare’s servers tried to load it, they failed, and many websites that use Cloudflare began returning HTTP 5xx errors (error codes users see when a server breaks).
Here’s the simple chain:
Chain of events
A Small Database Tweak Sets Off a Big Chain Reaction.
The trouble began at 11:05 UTC when a permissions update made the system pull extra, duplicate information while building the file used to score bots.
That file normally includes about sixty items. The duplicates pushed it past a hard cap of 200. When machines across the network loaded the oversized file, the bot component failed to start, and the servers returned errors.
According to Cloudflare, both the current and older server paths were affected. One returned 5xx errors. The other assigned a bot score of zero, which could have falsely flagged traffic for customers who block based on bot score (Cloudflare’s bot vs. human detection).
Diagnosis was tricky because the bad file was rebuilt every five minutes from a database cluster being updated piece by piece.
If the system pulled from an updated piece, the file was bad. If not, it was good. The network would recover, then fail again, as versions switched.
According to Cloudflare, this on-off pattern initially looked like a possible DDoS, especially since a third-party status page also failed around the same time. Focus shifted once teams linked errors to the bot-detection configuration.
By 13:05 UTC, Cloudflare applied a bypass for Workers KV (login checks) and Cloudflare Access (authentication system), routing around the failing behavior to cut impact.
The main fix came when teams stopped generating and distributing new bot files, pushed a known good file, and restarted core servers.
Cloudflare says core traffic began flowing by 14:30, and all downstream services recovered by 17:06.
The failure highlights some design tradeoffs.
Cloudflare’s systems enforce strict limits to keep performance predictable. That helps avoid runaway resource use, but it also means a malformed internal file can trigger a hard stop instead of a graceful fallback.
Because bot detection sits on the main path for many services, one module’s failure cascaded into the CDN, security features, Turnstile (CAPTCHA alternative), Workers KV, Access, and dashboard logins. Cloudflare also noted extra latency as debugging tools consumed CPU while adding context to errors.
On the database side, a narrow permissions tweak had wide effects.
The change made the system “see” more tables than before. The job that builds the bot-detection file did not filter tightly enough, so it grabbed duplicate column names and expanded the file beyond the 200-item cap.
The loading error then triggered server failures and 5xx responses on affected paths.
Impact varied by product. Core CDN and security services threw server errors.
Workers KV saw elevated 5xx rates because requests to its gateway passed through the failing path. Cloudflare Access had authentication failures until the 13:05 bypass, and dashboard logins broke when Turnstile could not load.
Cloudflare Email Security temporarily lost an IP reputation source, reducing spam detection accuracy for a period, though the company said there was no critical customer impact. After the good file was restored, a backlog of login attempts briefly strained internal APIs before normalizing.
The timeline is straightforward.
The database change landed at 11:05 UTC. First customer-facing errors appeared around 11:20–11:28.
Teams opened an incident at 11:35, applied the Workers KV and Access bypass at 13:05, stopped creating and spreading new files around 14:24, pushed a known good file and saw global recovery by 14:30, and marked full restoration at 17:06.
According to Cloudflare, automated tests flagged anomalies at 11:31, and manual investigation began at 11:32, which explains the pivot from suspected attack to configuration rollback within two hours.
Time (UTC)
Status
Action or Impact
11:05
Change deployed
Database permissions update led to duplicate entries
11:20–11:28
Impact starts
HTTP 5xx surge as the bot file exceeds the 200-item limit
13:05
Mitigation
Bypass for Workers KV and Access reduces error surface
13:37–14:24
Rollback prep
Stop bad file propagation, validate known good file
14:30
Core recovery
Good file deployed, core traffic routes normally
17:06
Resolved
Downstream services fully restored
The numbers explain both cause and containment.
A five-minute rebuild cycle repeatedly reintroduced bad files as different database pieces updated.
A 200-item cap protects memory use, and a typical count near sixty left comfortable headroom, until the duplicate entries arrived.
The cap worked as designed, but the lack of a tolerant “safe load” for internal files turned a bad config into a crash instead of a soft failure with a fallback model. According to Cloudflare, that’s a key area to harden.
Cloudflare says it will harden how internal configuration is validated, add more global kill switches for feature pipelines, stop error reporting from consuming large CPU during incidents, review error handling across modules, and improve how configuration is distributed.
The company called this its worst incident since 2019 and apologized for the impact. According to Cloudflare, there was no attack; recovery came from halting the bad file, restoring a known good file, and restarting server processes.
The post How one computer file accidentally took down 20% of the internet yesterday – in plain English appeared first on CryptoCho







