Learning from CrowdStrike with Taguchi

Published on September 10, 2024

The recent CrowdStrike incident is estimated to have “affected 8.5 million Windows devices” [1] and may have been “the worst cyber event in history” [1] How should we understand its impact on quality?

Genichi Taguchi’s definition of quality helps us understand how the CrowdStrike incident affected quality. He wrote that “quality is the loss a product causes to society after being shipped, other than any losses caused by its intrinsic functions”.[2] and that “loss should be restricted to two categories:

  1. Loss caused by variability of function
  2. Loss caused by harmful side effects”[3]

“Somebody must pay for this loss – Dr Taguchi called it a loss to society …. We all help to pay for a mistake, a breakdown, failure (bankruptcy) of a company, inept management.”[4]

The CrowdStrike incident can be seen as causing loss to society because it showed extreme variability of function and harmful side effects on “8.5 million Windows devices”[1].

Defining quality as the loss caused to society gives us an insight into the effects of an incident like CrowdStrike. The software that we test and develop also affects society. If a business or person that uses your software experiences problems due to a bug in our software that is a loss to society. Taguchi’s definition of quality shows that a measure of the quality of software is the loss it causes to society.

Taguchi won the Deming Prize, Japan’s highest award for quality, for his ‘loss function’ which enables the loss to society due to variability of quality to be quantified[5].

John Hunter gives practical advice on how to use Taguchi’s insight: “I have seen the concept of the Taguchi Loss Function used quite a bit. I have never actually seen any losses quantified and totaled and shown on a graph. I think focusing specifically on who suffers a loss and what that loss could be, can help. I think actually quantifying the losses to society can be daunting. So, while I see the value in framing the concept that way I think to actually get the losses quantified you are best served by starting with those closest to the process and then adding additional losses to those results” [6]

 A lesson that we can learn from the CrowdStrike incident is that we should use Taguchi’s insight to consider how faults in the software we develop and test can cause losses to society, and through this consider how to avoid these faults. 

References

[1] CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

[2] Introduction to Quality Engineering by Genichi Taguchi (1986, p1)

[3] Introduction to Quality Engineering by Genichi Taguchi (1986, p2)

[4] The New Economics by W. Edwards Deming (1994, p218)

[5] Introduction to Quality Engineering by Genichi Taguchi (1986, p19)

[6] Taguchi Loss Function blog post by John Hunter

Additional resources