How quality is created, maintained and lost in complex software systems

Published on October 19, 2025

The CrowdStrike outage of July 2024 is one of the clearest examples of how a single flaw can ripple through an entire ecosystem. In this post, I explore how it happened, why root cause analysis falls short, and what quality engineering can teach us about preventing the next one.


Content

The Crash

Recorded Talk

What Happened?

Discovering The Issue

How Did It Happen?

  • The Micro View: What Went Wrong in the Code

  • The Macro View: Why One Bug Crashed the World

Why Were China And Russia Not Affected?

What Was The Root Cause?

So What Do We Do?

  • Building Quality In Through Quality Engineering

Summary

Conclusion


Please note that this is a long post with quite a few images, so it is best viewed in a browser/substack app rather than your email client, which is likely to truncate the post.


The Crash

On the 19th July, at around 2:15 in the afternoon, Sydney airport staff started to notice some odd issues with some of their computers.

It was around 5 in the morning when the automatic barcode scanners at Gatwick airport stopped working, and staff had to manually verify and let passengers through.

It was 12 in the morning when some airports in the US began issuing ground stops. Which means all planes from that airport are prevented from taking off.

At first, it looked like whatever was happening was only affecting airports. But as more countries woke up, reports started coming in from hospitals, banks, and supermarkets. Even a few TV stations were unable to broadcast due to their IT systems all failing simultaneously.

With so many PC’s going offline at roughly the same time across the globe, it was starting to look like a state-sanctioned cyber attack.

Oddly enough, China and Russia hadn’t reported any incidents. All their airports, hospitals, and banks seemed fine. But rumours started circulating online that a software update had gone wrong.

But a lot of the companies affected said they’d not performed any updates. Their PC just restarted and started showing these blue screens, or got stuck in endless boot loops.

Then, someone purporting to work for one of these security companies posted this tweet: “First day at CrowdStrike, pushed a little update, taking the afternoon off”. Followed up with: “Fired. Totally unfair”, but it turns out that was a joke.

Well, partly, the company turned out to be correct. CrowdStrike had pushed out an update to its Falcon Sensor Security Software, which caused it to crash.

Resulting in nearly 24,000 customers being affected, causing 8.5 million PCs to crash or be disabled, and is estimated to have cost nearly $10 billion. It is currently considered to be one of the biggest outages we’ve ever seen.

So what went wrong? How did a company that releases up to several updates a day for years and is highly trusted by some of the most regulated industries cause so much damage?

Read more