Chaos engineering and sociotechnical systems

Published on July 24, 2024

This article is about a few things – chaos engineering, an analogy that explains it, then digging a bit deeper into the relationship between software and the team that produced it. It was sparked by a conversation with Stuart Day, for which I’m very grateful.

Chaos engineering

Chaos engineering is a technique to improve the resilience of a software system.  One of its famous enthusiasts is Netflix, who open-sourced a chaos engineering toolkit known as the Simian Army.  (This is because the various parts of the toolkit are named things like Chaos Monkey, Chaos Gorilla etc.)

The idea is that there are programs that attack your production code in some way.  This could be by killing parts of the system, such as one of the instances of an API.  Or it could be by injecting itself as a go-between between two parts of the system, deliberately being very slow in passing messages from one part to the other.  This is to force timeouts in one or both parts of the system.

It might seem like the last thing you want to do in the name of increasing the resilience of your system is to attack it.  However, it makes programmers worry about problems that they hope will never happen because they now will.  What if this bit of the system that my code depends on crashes?  Also, it checks the monitoring and related parts of the system – can you detect that the bad thing has happened? Finally, it acts as a fire drill at a convenient time, rather than at 3 a.m.

A crucial part of this is it’s done carefully.  There is a risk involved – customers could get a worse service.  There are several ways to manage the risk, such as trying it in a staging environment first, doing it out of hours first, starting with only small attacks to less important parts of the system etc.  Then there’s the reflection – how did it go? Is there anything we learned?  Anything we need to change? Etc.

An analogy – vaccines

One way to understand chaos engineering is through the analogy of vaccines.  Vaccines also aim to improve the resilience of a system, where the system is a human or animal.  They work by attacking the body in a small and controlled way.  The dose is big enough that the body learns how to deal with that kind of attack (by developing antibodies to the virus that was in the vaccine).  At the same time, the dose is small enough that the patient can cope with the attack – most of the time they will have mild enough aches or similar symptoms.

The analogy breaks

Once I thought about this analogy I noticed where it breaks.  With a vaccine, the thing being protected is something like your respiratory system, and the thing that becomes more resilient via the vaccine is a different system – the immune system.  However, these two systems are bound together in one body – when you get on a bus, your immune system and respiratory system get on it together.

With chaos engineering, the thing being attacked is production code, but the thing that improves is the skill of the people who build it.  When a new release is pushed to production, it’s only the code that’s deployed and not the people.  (I assume you don’t depend on a Mechanical Turk, where members of your team are hidden away inside empty server racks, manually responding to e.g. HTTP requests that come in over the network.)

This appears to be where the analogy falls down, because of the fundamental separation of team and code.

Digging deeper reveals the analogy isn’t so broken

However, I then remembered Conway’s law:

Organisations which design systems … are constrained to produce designs which are copies of the communication structures of these organisations

Code and team are definitely different kinds of thing, but they are strongly bound together.  I was reminded of the term I’ve heard people like Dan Terhorst-North use: a sociotechnical system.  I take this to mean a system where you need to consider both technical things (the software, hardware etc.) and social things (the individuals in the team and their interactions) to understand the system properly.

The code in production reflects the skills, blind spots, fears, preferences, habits, prejudices and experience of the individuals who produced it.  It also reflects the team – their goals, the ways they work together, what they consider important etc.  For instance, do the people with power hog conversations, so they fail to get a contribution from people lower in the pecking order, that might have prevented a disaster?

It’s as if the code and team are part of one iceberg.  The code is the part of the iceberg above the water – visible to users via its interfaces and then via the behaviour behind those interfaces.  The team is the part of the iceberg below the surface – hidden from users, but supporting the visible part.

The system as a whole – code and team combined – is both attacked by chaos engineering and improved by it, so vaccines are once again a suitable analogy for chaos engineering.

Iceberg floating in the sea.  The part below the water is visible.
AWeith, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons