10 Biggest IT Outages in History: Who Pulled the Plug?

Modern business continuity hinges on the reliability of technology.

When critical systems go down, the impact isn’t theoretical; it’s operational, financial, and reputational. IT outages have cost companies upwards of $740 million, with ripple effects that extend far beyond immediate downtime.

When such incidents occur, the response becomes crucial, not just to create trails of legal evidence supporting your case, but to cater to customers, ensuring they’re not (or minimally) impacted. Some companies make use of incident response software to make strategic decisions amid chaos.

The software platform will provide support; your approach will define how effectively you control the damages caused by any IT outage. To create a practical approach, it’s essential to get to the “why” and “how” of IT outages.

Below are a few examples that illustrate the biggest IT outages and their impact in detail. These incidents will help you identify the common loopholes that cause IT outages, helping you strategize a more realistic approach.

Biggest IT outages in history at a glance

Here are the IT outage incidents, causes, and their impact that made it big in the past, causing downtime for several popular websites:

Year	Incident	Cause	Impact
2024	CrowdStrike update crash	A faulty security software update. There was a bug in the kernel driver.	Affected 8.5 million Microsoft Windows devices, or less than 1% of all Windows machines.
2022	Southwest Airlines meltdown	Outdated crew scheduling software	59% of Southwest Airlines flights got cancelled. The company paid $600 million in reimbursements and $140 million in fines.
2022	Rogers Canada blackout	Internal routing failure	More than 12 million customers lost wireless and wireline services.
2021	Facebook/Meta outage	Faulty network config	Affected 3.5 billion users of its combined services. They experienced service unavailability.
2021	Fastly CDN outage	A customer config change triggered a software bug	Impacted 85% of their services.
2020	Google services outage	Internal storage quota issue (auth system)	Global Gmail, YouTube, Maps, etc. went offline, and users weren’t able to log in.
2019	Verizon BGP route leak	Routing misconfiguration	15% of internet traffic was misrouted at peak.
2017	AWS S3 outage	A typo in the server command	Hundreds of websites/apps went down. Close to $150 million was lost by S&P 500 firms alone.
2016	Dyn Domain Name System (DNS) attack	Distributed denial-of-service (DDoS) attack with Mirai botnet	Major websites like Twitter, Netflix, and CNN were down across the US/EU.
2011	PlayStation Network outage	External hack causing a security breach.	Presumably, about 77 million accounts were affected.

The biggest IT outages in history by year

Below is an overview of different IT outages that have made it into history. Let’s get to them without a second’s spend.

2024: CrowdStrike global IT outage

Cause: A flawed CrowdStrike Falcon Sensor update caused Windows devices to crash to a blue screen on reboot.
Impact: Around 8.5 million Windows systems crashed worldwide on July 19, 2024.

CrowdStrike’s routine software update with a critical logic error triggered a global IT outage in July 2024. It was a misconfigured channel file for Windows. When it was pushed to customer devices globally, it caused any Microsoft Windows running Falcon, CrowStrike’s security agent, to crash immediately. Users saw a blue screen upon reboot.

Within hours, the impact reached 8.5 million Windows PCs and servers. The effect was highly unprecedented. All U.S. airlines grounded flights as a precaution. All institutions, including banks, hospitals, and government offices, experienced the outage.

Although it’s tricky to put a number to the actual economic impact, Nir Perry, the CEO of cyber insurance risk platform Cyberwrite, said, “the damages could reach tens of billions of dollars.”

Interestingly, this wasn’t a cyber attack but a lapse in software quality control. This exposes risks associated with centralized software updates and better patch management practices.

2022: Southwest Airlines meltdown

Cause: A legacy computer system failure in Southwest’s crew scheduling software.
Impact: 59% of Southwest Airlines flights got cancelled.

While weather conditions can understandably cause flight delays, the mass cancellations by Southwest were not primarily weather-related. Other airlines facing the same winter storm managed to resume normal operations relatively quickly, unlike Southwest.

To illustrate, Southwest canceled 59% of its flights, compared to only 3% canceled by other major carriers. Southwest itself has admitted that these widespread cancellations and delays since December 24 stem from internal issues within the airline’s control.

The airlines impacted countless travelers, causing them to miss family gatherings. Many customers experienced frustration because they were unable to reach Southwest representatives for assistance. In the aftermath, Southwest accelerated plans to upgrade its technology.

The company was fined a record $140 million (£110 million) by the US Department of Transportation (DOT). In addition, the company paid a reimbursement of around $600 million to passengers.

2022: Rogers Canada blackout

Cause: A maintenance update deleted a routing filter, causing Rogers’ core IP routers to overload and crash.
Impact: More than 12 million customers lost wireless and wireline services.

On July 8, 2022, Canada’s largest telecom provider, Rogers Communications, experienced a catastrophic outage that impacted a wide range of services across the country. It began when Rogers was implementing a scheduled update to upgrade its core IP network.

The technician made an error and removed a critical BGP routing filter on the core network distribution routers. This resulted in the complete collapse of the Roger network. Since it was a prominent provider in Canada, it knocked Canada’s network entirely. Even a few 911 emergency calls failed, raising concerns about public safety.

A later government report found the company lacked proper network redundancy and had tied both wireless and broadband services to the same core infrastructure, making the failure “extreme”.

2021: Meta/Facebook global outage

Cause: A faulty configuration change on Facebook’s backbone network disconnected Facebook’s data centers from the internet.
Impact: Affected 3.5 billion users of its combined services. They experienced service unavailability.

On October 4, 2021, the social media giant Facebook (now Meta) experienced a historic outage, affecting users worldwide. Facebook’s internal network underwent routine maintenance. An engineer issued a command to update the network configuration and unintentionally took down all BGP routes to Facebook’s DNS servers.

The outage lasted about 5.5 hours before the team could manually restore the networking equipment. Close to 3.5 million Facebook, Instagram, or WhatsApp users were cut off from the platforms.

The cause of this incident was the misconfiguration of backbone routers and the failure of an auditing tool that should have caught the mistake.

2021: Fastly CDN outage

Cause: Software bug in Fastly’s CDN code, triggered by a customer’s valid configuration change.
Impact: 85% of their services returned errors, taking down high-profile websites like The Guardian, CNN, and some streaming platforms.

A cloud content delivery network (CDN) outage demonstrated how one glitch could take down a significant chunk of the web. Fastly, a top CDN provider, experienced a global outage on June 8, 2021. It impacted some of the high-profile websites and some government websites in the UK.

The engineers were able to detect the problem and identify the culprit configuration. The incident raised awareness of the reliance on a few CDN providers and prompted companies to revisit redundancy for their web infrastructure.

Here are some quick takeaways and learnings from the Fastly CDN outage:

Diversify delivery services. Consider two or more CDNs for optimal delivery. It reduces the impact a CDN would experience when it faces service disruption.
Create a backup plan. Ensure visibility into indicators of issues and know when to activate backup procedures.
Understand your dependencies. Consider the hidden ones and even the indirect dependencies. If you rely on external services for site or app components, understand dependencies like DNS, hosting, etc.

There are several other aspects to consider when taking steps to manage IT outages. Most importantly, how you handle the incident speaks volumes about your commitment to serving your customers.

On the tech side, if you have an incident management software onboard, it will help you respond, report, investigate digital incidents, and keep things in sync when everything emerges into chaos instantly.

2020: Google service outage

Cause: An authentication system bug was caused by an internal storage quota issue.
Impact: All Google services worldwide were unreachable

On December 14th, 2020, all Google services went down all of a sudden. Gmail, YouTube, Docs, Maps, Calendar, and even Nest smart home services stopped working. The search engine giant later confirmed the cause of the outage. It was an issue in the central identity management system.

The internal storage quota was exhausted in the system that handles user authentication. Due to this, Google’s login and account APIs failed. They became globally inaccessible for roughly 45 minutes. During this time, only the search engine remained up as it didn’t require any login.

Although it lasted under an hour, the outage’s impact was immense due to Google’s ubiquity. It underscored how a hidden single-point failure, in this case, an internal quota configuration, could disrupt the daily workflow of billions, from businesses to schools and consumers.

2019: Verizon BGP Route Leak

Cause: A misconfigured BGP optimizer at a small internet service provider (ISP), compounded by Verizon’s lack of route filters.
Impact: Internet traffic misrouted, causing outages and slowdowns for Cloudflare, Amazon, Facebook, and others. Cloudflare saw a 15% drop in global traffic during the incident.

On June 24, 2019, a BGP mishap demonstrated how fragile the Internet’s routing system can be. It started when a small Pennsylvania ISP, using a BGP optimization tool (Noction), leaked thousands of improper routes to its upstream provider, Verizon.

Verizon, one of the largest internet backbone providers, propagated these routes globally instead of filtering them out. The result was an internet traffic jam: large portions of traffic destined for big services were erroneously routed through DQE/Verizon’s network and then dropped or slowed because those networks couldn’t handle it. Cloudflare reported a 15% loss of its global traffic at its worst point.

The incident lasted a few hours on Monday morning before the bad routes were corrected.

2017: Amazon Web Services S3 Outage

Cause: An AWS engineer mistakenly removed too many servers during a routine procedure.
Impact: 4-hour outage of AWS S3 storage in one region, cascading failures across many apps.

On February 28, 2017, Amazon’s widely used cloud storage service, Simple Storage Service S3 in the N. The Virginia region went down due to a simple mistake. An AWS team member, while debugging the billing system, ran a maintenance command with the wrong parameter, removing a far larger set of servers than intended. This triggered a cascade: critical S3 index and placement subsystems lost capacity and had to be restarted, a process that took hours.

For about four hours, S3 was unable to serve requests in that region. Popular websites and apps like Quora, Slack, Medium, Trello, Business Insider, and Docker Hub became unavailable or severely degraded. Even AWS’s own status dashboard failed, since its icons were stored on S3. The economic impact was substantial.

One analysis estimated that S&P 500 companies alone lost $150 million due to the incident, not counting the numerous startups and third-party services also affected.

2016: Dyn DNS Attack

Cause: DDoS attack by the Mirai IoT botnet.
Impact: Major websites, including Twitter, Netflix, and Reddit, went down across the US and Europe.

On October 21, 2016, a DDoS attack on DNS provider Dyn disrupted internet access on a massive scale. Dyn’s role was to translate domain names to IP addresses for many popular sites. Beginning that morning, a Mirai botnet, comprising hundreds of thousands of malware-infected IoT devices, bombarded Dyn’s DNS servers with fake lookup requests, overwhelming them.

Dyn estimated that approximately 100,000 malicious endpoints were targeting its infrastructure, with traffic peaking at 1.2 Tbps, roughly twice the size of any previous DDoS on record at the time. The attack came in waves and knocked offline primary services like Twitter, Netflix, Reddit, PayPal, CNN, and even The Guardian’s site, for users across the U.S. and Europe.

2011: PlayStation Network Outage

Cause: External hack and data breach, forcing Sony to shut down PSN servers.
Impact: More than 77 million users had their personal data compromised.

In April 2011, Sony’s PlayStation Network (PSN) suffered one of the longest gaming service outages in history. Hackers infiltrated PSN between April 17 and 19, stealing account data like usernames, passwords, and possibly credit card info from over 77 million users.

In response, Sony completely shut down PSN on April 20 to contain the breach. The network remained down until May 14, leaving PlayStation gamers unable to access online games.

This incident, essentially a massive cyberattack, cost Sony an estimated $171 million in remediation and security improvements.

A quick takeaway, before your service takes away

These facts unanimously suggest that cyber attacks aren’t the primary cause of the majority of IT outages, despite how often they’re blamed. You don’t necessarily need to look out for malicious attackers when you have humans and misconfigured settings (unintentional, of course) doing their job.

This highlights the importance of addressing internal security gaps before implementing defenses against external threats. There’s no priority here, but you should take both internal and external security equally seriously to keep your system up.

Learn more about incident response and make security incidents less chaotic.

Source link