
Measuring the impact of LLM-based tools
At a recent DORA community Metrics Monday, one of our Lean Coffee topics was about ways to measure the impact of LLM-based tools. The person proposing the topic meant specifically the IDE coding assistants such as GitHub CoPilot, and possibly command-line code generation tools. This got me thinking, and my own thinking extends to the other widely used LLM-based tools, ChatGPT and its ilk such as Gemini. Lots of people are diving right into using these tools. There’s talk about benefits of time-saving. I’m curious – how much time can we save using these tools? Also, what do people do with that freed-up time?
Note: Terminology trips me up at this stage of my learning journey. Some people say “AI-based”, or “Generatitve AI”, and maybe those are good generic terms. I’m told that tools like ChatGPT, Gemini, and GitHub Co-Pilot, are Large Language Models – LLMs.
There are studies!
GitHub conducted its own research into quantifying GitHub Co-Pilot’s effect on developer productivity and happiness. Personally, I avoid the word “productivity”, it brings back scar tissue that started forming when I was in MBA school. It means different things to different people, and it’s often weaponized against innocent employees.

That said, I like this study because it looks at both measurable numbers and human feelings. They used Dr. Nicole Forsgren’s SPACE framework which includes Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. The results showed that developers completed tasks faster, conserved their mental energy, helped them focus on more satisfying work, and generally experience more joy.
Other research found similar results. This PDF from Google Cloud, “Gen AI’s impact on developer productivity” recommends focusing on business outcomes rather than individual developer effort and output metrics. It provides a survey based on DORA and the SPACE framework to get a baseline on your organization’s level of developer productivity.
Anecdotal evidence
Talking with people who’ve been using LLM coding assistants like Co-Pilot, I’ve heard some interesting outcomes. Some of these match my own quite limited experience. One feature I value a lot is that these tools generally do a good job of explaining code. I was a coder once, and I’ve paired and ensembled with coders for many years. And, I’m in a Code Reading Club where I’m learning a lot of useful techniques. Using the code explanation feature of a coding assistant is a valuable addition to my toolbox.
Developers I know also find the code explanation useful. They say it helps point out opportunities for refactoring. Another benefit is improved documentation and more tests.
These tools are not a replacement for learning the skills yourself. Results of using them vary with level of experience, programming language, project maturity and other factors. Context matters, and we should keep that in mind when trying to measure the impact of using these tools.
Potential metrics
I find that DORA key metrics are often a good place to start when measuring the effect of any experiment. These are: Change lead time, deployment frequency, change failure rate and failed deployment recovery time. Janet Gregory and I wrote a blog post about applying these key metrics to process quality.
Lead time for changes is a good example. This is the time it takes from when a merge or pull request is made to merge a change into the code repository trunk, to when the change is running in production. Let’s say that developers are able to produce code much faster with the help of an AI coding assistant. That could mean that they commit larger changes. That could mean a longer code review process, slowing down the lead time.
Conversely, let’s say the coding assistant helps developers refactor the code and greatly improve its maintainability, testability and operability. This could help prevent bugs from making it into production, and reduce the change failure rate – the percentage of changes put into production that resulted in impaired service or outages.
Design experiments
Shiny new technology! We all want to dive in! Before we get too excited, we might want to take a step back and consider what we hope to achieve. Get the team together to talk about the biggest problems getting in the way of your goals for satisfaction, performance, or other aspects of the SPACE model. Then, consider whether a GenAI tool like Gemini or GitHub Co-Pilot could help make that problem smaller. Design a small experiment – something to try for a couple of weeks or a month. Create a hypothesis.
Here’s an example. Let’s say your team would like to improve code and test quality so that pull/merge requests are approved more quickly. Your hypothesis might be something like: “We believe that using an AI coding assistant will help us get changes to production more frequently. We’ll know we have succeeded when our lead time for changes is reduced by 5% , and developer joy is increased by 10%, in the next two weeks.”
Of course there are lots of other ways you can measure. I’m giving the DORA metrics as an example and a starting place. The important thing is: identify a problem, set a short-term goal, choose a LLM-based tool that you think will help achieve that goal, and design a small experiment that includes a way to measure results.
Evaluate the results frequently. If you’re moving towards the goal, continue the experiment. If not, tweak the experiment or try something new. Perhaps team members need to learn ways to use the tool more effectively. Perhaps another tool or technique would work better. Small experiments will help you learn quickly whether all this mind-blowing new technology is moving your team in the right direction.
And don’t forget that the most important metrics are well-being and job satisfaction. Focus on outcomes and trends rather than hard numbers.
Use caution
Generative AI and LLMs have lots of potential to help us work better. They can take over some of the drudge or repetitive work we don’t like. And, they present many serious dangers. They can pose huge security risks. They may be biased and poorly trained and provide you with bad information. That could be a whole ‘nother blog post. Instead, I point you to this one by Birgitte Böckeler on Martin Fowler’s website.
I’d love to hear about your and your team’s experiences! Please share how you are gauging the impact of AI technology on your team.
The post Measuring the impact of LLM-based tools appeared first on Agile Testing with Lisa Crispin.