
5 steps to a deterministic test suite
If you’re writing complex software, and if your automated testing is comprehensive enough to catch real-world defects, then you probably also have a flaky test problem.
Flaky tests are a problem because, if left unchecked, they undermine confidence in your software quality. New flakiness isn’t noticed quickly, and you get into a pattern of either spending all your time investigating test failures or releasing without fully investigating them. You might start out by thinking that “it’s okay because those tests don’t matter”, but eventually a test will fail that does matter, so you end up releasing regressions. In its early days, Undo went through that phase, and it wasn’t a good place to be.
So how did we do better?
One prong of our approach was to write better tests. I’ve previously written about how finding the sources of flakiness in our tests allowed us to reduce our regression rate.
But knowing how to improve your tests is only part of the solution:
- it’s too big a job for one person, so you need an efficient way to distribute the work;
- often, the reason for the nondeterministic behavior is in product rather than test code, so you can’t look at the problem entirely through a testing lens.
Much has been written about writing high-quality software in general, so the question I want to answer here is: how do you convince a team of software engineers with lots of other important work to do that they need to spend their time fixing flaky tests? Here’s what worked at Undo.
Step 1: Choose metrics wisely to communicate the scale of your problem
Knowing and communicating how big a problem you have is much easier if you have a metric for it.
Initially we tracked the number of failing test runs per month, but that was too coarse — by the time the metric had regressed it was hard to dig ourselves out of the hole that we’d found ourselves in.

We needed something more sensitive. So we started measuring the number of regressions per test run instead, and presented the results monthly to the whole of our development team. This requires more work as every failure needs to be logged, but it was a necessary step to improve.

Step 2: Divide your product into named modules and identify maintainers
For software to be reliable, it’s crucial to have a robust process for identifying owners for test failures. When I joined Undo many years ago, we had a big pool of tests which often tested functionality that cut across a number of product modules that were identified with test tags. When one of our tests failed, it was difficult to find an owner because it wasn’t clear what the primary module being tested was, and those modules didn’t have explicit maintainers.
For the first of those problems, it turned out that the vast majority of our tests, even if they tested multiple modules, could be associated with a single primary module.
For the second, we sought to find maintainers for every module. The ideal number in my experience is two: you want sufficient bus factor while minimizing diffusion of responsibility. We recorded the maintainers in our CODEOWNERS file in git.
Now, when a test fails, the accountability for resolving it is much clearer.
Step 3: Weaponize bureaucracy to force change
When the rate of test regressions was high, we needed a daily meeting first thing in the morning to understand each failure and find an owner for it. When the situation improved, we didn’t need the meeting any more but the risk of backsliding remained. So we needed a self-correcting approach:
- When the pass rate was above 50% we cancelled the meeting.
- When the pass rate was between 30% and 50% we would cancel the meeting only if the metric was going in the right direction.
- When the pass rate was below 30% we held the meeting as scheduled every day.
Being able to say that if our regressions metric deteriorated beyond a certain value we’d need to have lots of meetings was a good incentive for us to collectively own failures and get them resolved.
Step 4: Focus on fixing flakiness, not improving product quality
Sometimes, flaky test failures aren’t caused by real product defects, so fixing them isn’t going to directly improve a customer’s experience of your product. It’s tempting to focus your efforts on the failures that are going to affect your users.
The problem with this line of reasoning is that understanding the impact of every failure takes a lot of investigation time. If you try to rule out customer impact in every test failure, then you’ll perversely waste a lot of engineering time on failures that are never going to affect real users. All else being equal, it is better to prioritize the failures that are happening most often, regardless of their potential user impact.
At Undo we use three priority buckets:
- High priority: fail persistently, must be owned, must be a top priority for resolution within the next working day.
- Medium priority: more than one failure seen in the last month. These really should be owned, and progressing them at a sufficient rate is critical to reduce overall flakiness levels.
- Low priority: less than one failure seen in the last month. It’s good for these to be owned, but we shouldn’t fool ourselves that they will substantially affect overall flakiness levels.
Organizationally, the High and Low buckets tend to look after themselves. Paying attention to the Medium bucket is what allowed us to see the wood for the trees and identify the real problems in our test results, resulting in the quality improvements that we wanted all along.
(If you’re concerned about ignoring low-priority failures: candidly, at Undo we have a longish tail of intermittent failures that have only ever happened once, many months ago — I bet you do too. And you know what? In 10+ years I’m not aware of any customer running into one of them. Prioritizing these failures behind the ones that fail more often has allowed us to spend our valuable engineering time more effectively.)
Step 5: Save good artifacts when tests fail
When a test fails, it’s often impossible from the output to determine what went wrong. But you need to progress it somehow.
Sometimes we’d need to add extra logging that would be printed the next time we see a failure, other times a process listing might show where your CPU time is going so we configured our test system to dump this information as a matter of course.
Another option that was available to us was to dogfood our own Undo suite. When a test fails, we re-run it with our process recording technology enabled, and this dumps a recording file with a full history of our product’s behavior. If this is an option for you too, try it out — you might find it’s what you need.

Conclusion
I’ve listed five steps that each played a crucial role in making our test suites more reliable. Though I’ve presented them as a series of sequential steps, the reality is that we progressed each one gradually over many years with the balance of work shifting as our requirements evolved.
Not every software product will have the same testing challenges as ours, but if your problems look similar, I hope that some of the material here will be useful.