Code Coverage And Other Metrics

Published on July 15, 2025

I’ve already talked about code coverage of unit testing, and why the metric sucks. In short, code coverage sucks because of two things. As a measurement it doesn’t tell us much about quality, and some people (I will not name names), abuse it to hit other people on the head with it.

So you may ask, how has it become such a popular quality measurement?

Assuming we’re not evil, and not looking to hit people on the head, we use coverage as a proxy, a placeholder, for the answer we’re really interested in: Is the software ready for release?

We’re looking for a quality metric, which can be used for deciding about release.

Now, the reason code coverage took off, is because it’s easy to measure, as it runs on our code. We’re trying to project from our code to actual quality. If we wrote tests for our code, and they are passing, our code must be good.

Unfortunately, there are a few problems with that idea.

First, it doesn’t take into account if the tests are good, for any definition of good. They may not check the right thing, or check it correctly. It also doesn’t tell us if the code does everything it needs to do. We can have 100% coverage of code that’s doing 50% of the job.

More importantly, our code is a very small piece of the puzzle. Our application runs servers, operating systems, libraries, 3rd party software, other teams’ software – all this before running the first line of our code. So, we may feel good about our code coverage, but it just tells about the fraction of what’s really running.

Code coverage doesn’t reflect anything meaningful in terms of quality. It can’t be used as a release metric.

Code Coverage Fails. What Else Have We Got?

So what else do we have to tell us if we’re ready to release? Next is the test plan. Or rather the test report. We can look at passing tests and get an idea from that, right?

Let’s say we’ve put our best brains together, and came up with 100 workflows that, if (and when) they all work, then we can release the software.

That is a lot better, because it ignores how the software is built. We can build everything ourselves, or buy/rent/steal everything, and not write a line of code. Either way, the quality criteria stands. 100 passing workflows good, 0 – bad.

That’s where black-box testing comes from, by the way. The idea that software structure doesn’t matter, only inputs and outputs. But, this method has a few issues.

First, are all workflows equal in importance? Let’s say I’m testing a text editor. One of the workflows in question is saving and loading the document. Another is using keyboard short-cuts for editing.

Do these workflows have the same level of importance? For example, if the first one doesn’t work, and the second one does, will we release the software?

Just a number, or percetage, is missing the context we need. But that’s not the only problem. We’re looking at a moving target.

These 100 workflows is just the beginning. Next release these 100 become 110. The inequality grows, because it’s hard to compare new to old. Old workflows are less important because they are stable now, the new ones – less so.

Automation or Manual?

Automation confuses us even more. Of course, we’ve automated the main workflows, but not all of them, so we run the other workflows manually. But only occasionally, because it’s hard and takes time. And we need to fit 20 more cases, which are important, but obviously not as important as the ones we didn’t run. Which are important, because we said so in the beginning.

Are you confused yet?

In other words, we bend the rules all the time. The numbers don’t lie, they just project an image that is hard to understand.

And then we go and make decisions based on them.

One more thing. In a lot of cases, the configuration we’re testing our app on, may not be the same as the real production environment. Out staging environment, is not exactly, or sometimes completely different than the production environment.

We see a set of quality measurements, that may or may not apply to the real world. That’s why we need to test in production too. But when we do, does that count as increasing the coverage? Does it really matter by now?

Do we really need metrics?

There is no one magic number (or metric) that can tell us, if we are ready to release. And there won’t be. We’re still going to look at several measurements, and opinions, to make a decision about a release.

Even in automatic continuous delivery-based systems, we still need to define what works, and what’s good enough to push forward. Or outward.

The truth is, the word “coverage” is misleading and also, not really useful. It creates a feeling of safety. But, in order to get that safety, we can’t rely on tools spitting out numbers, to keep us confident.

Yes, tools can help. And some decisions can be encoded. But we need to remember what these numbers mean, and ask: are they still applicable, and how?

Until then, check out my Unit Testing and TDD workshop. Yes, I talk about coverage there. I still say the same thing there. But at least, you’ll learn what the tests really give you. And how you can make the most of it.

The post Code Coverage And Other Metrics first appeared on TestinGil.