
What can a mistake cost in the age of automation?

We all know that now is a difficult period in the IT industry. Companies are fiercely competing, making deals and compromises with themselves, automating absolutely everything they can reach. And those who have been working in the QA industry for a long time have probably seen at least once how compromises within a company greatly limit testing.
But is the reduction in quality worth the risks it can lead to? Of course, in each specific situation, this is a question of risk management and the people who manage this risk. But now I want to remind you of one old but striking case when a critical error could not be detected in time and it cost more than $460 million.
In this article, I will talk about a large stock broker Knight Capital Americas LLC and the events of 2012 described in the report of the U.S. Securities And Exchange Commission.

Here’s what happened:
- On August 1, 2012, the original Power Peg code was replaced with the new SMARS. However, 1 of the 8 servers was not updated.
- The update was performed without proper testing.
- At 08:01 AM, the system started sending so-called BNET reject emails.
- The company’s employees did not have procedures for responding to such messages and 97 failure messages were ignored.
- From 9:30 AM, the system began sending erroneous orders for a total of $6.65 billion.
- Having discovered the problem, the employees deleted the SMARS code that had just been delivered.
- No one tested this situation and it turned out that deleting SMARS would activate Power Peg. Now all the company’s servers began sending erroneous orders and the situation worsened.
- In 45 minutes, orders for $6.65 billion were sent and about 4 million transactions were executed, and the company’s loss amounted to more than $460 million.
A story worthy of a movie, but let’s look at the U.S. Securities And Exchange Commission report and understand the reason for this error.
So the first problem that caused the events is Technical Debt.
Of course, this problem could have been avoided, but here’s what happened:
- Almost 9 years before the events, the company stopped using the old Power Peg code on its servers. However, it was still possible to execute this code.
- About 7 years before the events, this code was partially modified. But someone decided that this change did not need to be tested.
- When implementing the new code in 2012, Power Peg was accidentally activated, since no one tested what would happen if the new SMARS code was removed.
The second problem is Processes.
I always say that a large company with a lot of money is forced to live on the basis of processes, otherwise the chances that it will not survive increase. This is how it was at Knight Capital:
- There were no formalized procedures for deploying code. One technician copied the code to 7 servers out of 8, without checking with colleagues.
- There were no processes for testing unused code.
- The documentation was incomplete and inaccurate, which prevented the risks from being identified in advance.
The third problem is poor communication between the business and the IT department.
This is how it was at Knight Capital:
- The employees received almost a hundred notifications, but did not know what to do and decided that it was just a formality.
This report shows that quality is always not just about one specific line of code or application function. Quality is a complex issue and must be addressed accordingly.
Can all errors lead to such huge losses — of course not.
Does anyone want to be next on the list — of course not.