
Complex systems are bound to fail
When a system works, we – humans – too often ensure that we bring it to the brink of failure.
Many systems around us are working at the edge of failure. Any issue of a certain severity makes it fail. Reliability seems to be an afterthought, robustness just too expensive. We want to make everything better, more efficient, more effective, just more. We always want more. We accept failures to happen. We extend systems until we don’t understand them anymore. We seem to be incapable of accepting a simple solution as is.
Look at ecosystems. Humans intervene with them all the time. And then they wonder why the system collapses. Either we don’t care or we don’t understand it.
There are many ways how a system may fail.
Adding to the capabilities – forgetting the core problem to solve.
I’m an MS Word user for 30+ years. The last time I used Word, I needed most probably less than 0.1% of the functionality. Maybe even less.
Programs and apps get so convoluted, just to add more and more functionality. Some applications became so popular because they were simple, solved an immediate problem at hand, and were good at it. But then comes the effect of we need more. Of course, if something is good enough, we cannot leave it at that. We need to add more to make it even better. Just to make it worse.
Rarely do I know of a product that got better by adding more and more capabilities. Not every product needs to be a Swiss Army Knife. If the product solves something very good and in a very efficient way, there is no need to make it do more. Is it inventors, business people, egos, I don’t know. When someone had success with a solution for whatever reason, they want more. There are not many inventors who made one thing with huge impact, and then created the next brilliant thing. In software it’s even worse. When you have one good application, the owners start putting more and more functionality in. Because only more is more.
Look at cars. The majority of people who want a car need it to move themselves, their families or friends and a bunch of goods from A to B. In the first world we build cars that are so luxurious and overpowered, that only the top 1% of the population can actually afford them, even though 5% or so buy them. Are we building cars for the wrong reasons these days? What happened to the Volkswagen Beetle?
Reaching capacity limits
Systems that cannot handle the load anymore. It starts with a simple solution, maybe you thought ahead for a bit. Over time more and more was added. The system needs to do more, or there are more actors in the system, and finally the architecture of the system starts to fail. The system has reached its capacity limit.
There are examples in software applications, hardware infrastructure, or systems like your local public transportation network. Look at the suburban trains in Munich. The system is designed to bring people from the suburbs into downtown. As Munich has a distinct center they designed the rails in a way that lead from West to East through the middle of town. And every suburban line is attached to each end. Yes, they all meet at both ends and go through the middle, on one track each. You can imagine how fragile this system is all the time. We have long reached the capacity limit of trains going through the tunnel. Oh, yes, the central part with one track in each direction is underground.
Optimizing for failure
Trying to make things lean (like processes or people) up until the point that too much was removed from the system. When key elements of a system are removed that took care of the stability of the solution, the system fails. Teams, groups, companies, but also many other types of systems are continuously optimized.
We start with a system to solve a certain purpose. In the beginning we might not know what exactly is needed, so we add a bit extra to be on the safe side to build a good solution. At a certain point in time the system is ready to start solving the problem at hand. And now comes the point, where we start with optimizing. We take elements away, that were too much. We bit by bit reduce the nodes of the system. When you do it slowly and carefully, you can react to malfunctions of the system and reintroduce the element back. When you do it too quickly, it will be harder to understand what just happened and why it starts failing. You have removed a service that was doing one function more than you thought it was, and now the system doesn’t work anymore. As you don’t know that it was that service, it will take a bit to understand.
Or you start removing people from the teams, redistributing them to other projects. People are not clearly defined resources though, even if they are often treated like that. There is knowledge, enthusiasm, engagement, relations, and many more that defines a person. You cannot just remove a person without thinking that it has only the impact that the key role or job of the person is no longer done. That is a bit short-sighted.
Don’t forget resilience and robustness, if you have the chance
Eco-systems often entail a certain resilience and ability to react to change. Humans can get away with a lot of change until the system collapses more or less from one day to the other. But things don’t just happen. Remember the web of causation?
When you are part of a system and you can design the system, keep resilience and robustness in mind. You have only so much influence on your system. You never know what will happen. The most pesky pests of all: users! Users! The world could be so nice without them. Users are creative, stupid, intelligent, malicious, arrogant, and so many other things. If you don’t want your system to fail, keep that in mind.
Prepare for tomorrow. If the system is not part of a very well defined larger system, things will change. Know what your limits are or implement monitoring that show you early when you reach them. Design the system in a way that makes it easy to maintain and extend. To take an IT example. Writing down 100 lines of code to implement a certain functionality is easy. But implementing it in a way that you can still understand tomorrow what it does, or that it is easy to adjust or remove is not that easy. Making a system scale, failover tolerant, recoverable, these are the hard parts.
One last rant!
Because it just came up in the latest episode of the Cabrera Labs Podcast.
The whole is more than the sum of its parts.
No! It is not! If the whole is more than the sum of its parts, you just have not understood or found all of the parts, relations and perspectives.