Why I’m Betting on LLMs for UI Testing

Published on June 29, 2025
If Medium puts this content behind a paywall, you can also view it here (LinkedIn).

Right now, we have two GenAI camps — the All-In-Folks and the Skeptics. Some people view GenAI as the solution to every problem in the world; others are absolutely refusing to use it or trust it and are clinging to the traditional ways of doing things. I tend to be a pragmatic dude, somewhere in the middle. I am excited to use it (every day), but I also try to understand the limitations and be realistic. The more I use it, the more excited I get about its future. Like any other new technology, I try to look ahead — a lot of the current limitations will eventually be worked out. Skate to where the puck is going, not where it has been.

For all the time we spend talking about GenAI generating code, we spend very little time talking about GenAI generating and executing tests. I believe they are so much better at the latter than the former.

Specifically, I have been playing with GenAI for UI testing. There’s much literature about GenAI for generating unit tests, but not nearly as much for integration and end-to-end testing. I had theorized that LLMs would be pretty good at that.

There’s lots of inputs you could feed an LLM to have it generate tests: specs, design docs, production code, pre-existing tests, code coverage information (so that it can aim at generating code that tests paths that haven’t been exercised yet), operational issues (to give it some hints as to where problems may lurk), etc.

The main question for me was: in what format should those auto-generated integration tests be?

  • Path#1 is having the LLM generate tests as code (e.g. as Selenium, Appium, etc), then executing those tests as usual.
  • Path#2 is taking a bigger leap — what if we used an LLM for generating the tests as natural language, then used another LLM to execute them?

There are pros and cons to both paths.

  • Using LLMs to generate test code means you can inspect the code, and once it’s generated it becomes “deterministic” — meaning you know exactly what gets executed on every single test run. This is within people’s comfort zone. [Not to be pedantic but I would challenge that if you are testing a distributed system with retries, your tests are already not deterministic, but I digress…]
  • But if we’ll live on a world where both LLMs and humans write integration tests, it’s so much easier for a human to write a test in natural language than it is to write it using a traditional UI framework like Selenium or Appium. So test authoring in natural language wins.
  • Generating tests as code in a traditional framework also means that whoever is inspecting that code needs to be knowledgeable in that particular framework. In contrast, anybody can inspect a natural language test and immediately understand what it’s doing. We all speak natural languages!
  • Generating tests as code means you are now on the hook for maintaining and evolving that code for the rest of your life, like any other piece of code. Software upgrade? Your problem. Production code changes, so you have to change the test code to reflect that? Also your problem.
  • Most interestingly though, generating tests as natural language and having them executed by an LLM creates an opportunity to think differently. UI tests are notoriously flakey. Two of the reasons are: [1] unexpected message boxes can pop up randomly; [2] often tests drive a UI by navigating the DOM and looking for element ids, if those change, the automation doesn’t know what to do. Both of those are easily navigable by an LLM.

To illustrate this, I wrote a simple UI test and fed it to an LLM for execution. As I walk you through this example, I encourage you to not just focus on the Action that the LLM took, but focus on the Reasoning of the LLM. The little thinking bubbles in my graphs are verbatim what the LLM reasoning field was, which gives you a great insight into how LLMs reason through their world.

Step 1: It typed “Harry Potter book” in the search result.

Step 2: It clicked on the Search button

Step 3: It verified that the right book was present in the search results.

Step 4: It clicked on the correct book. Ironically, it did that while trying to save me money, since it picked that version because “it was the most economical option!”

Step 5: This is where it goes off the rails a bit. The “Add to cart” button is below the fold here, so it’s not visible in the current screenshot. The bot found an “Add prime to get fast, free delivery” button and it theorized that if it clicked it, it would add it to the cart. It was not correct, but to be fair, it was not a terrible guess, as the bot was trying to find a path.

Step 6: It figured out on its own that that was not correct, and it theorized that if it scrolled down, it would find it, which is correct.

Step 7: It found the “Add to Cart” button, it’s happy now!

Step 8: It’s validating that it added it to the cart. Notice it validates in four different ways, which is better than how I would have validated if I was writing this test!

Verification #1:Cart subtotal shows the right amount of money
Verification #2: “Added to card” confirmation message
Verification #3: Cart icon shows 1 item
Verification#4: The Harry Potter book is visible in the cart preview

This simple test illustrates the power of LLMs in the domain of UI Testing.

The elephant in the room is non-determinism. How do you guarantee that the LLM takes the correct path navigating through your app as it is testing it? In my little example, the bot clicking on the “Add Prime” button was a hint that when navigating a UI, a bot could definitely get sidetracked.

More amusingly, in another execution, the bot couldn’t login with the given username and password, so it attempted to create a brand new account all on its own. When that too failed, it actually attempted to chat with customer service to work things out. This is actually really impressive — that was one determined little bot!

To address that, one idea we’ve been kicking around among some friends is giving each test an “execution budget.” It took 8 steps to execute my little example. So if a bot is still trying to accomplish the task after 20 steps, it probably veered off course and it’s doing something it isn’t supposed to be doing. So tests could have a budget as a guardrail.

More broadly, as we shift more towards depending on LLMs for test execution, we need to spend a lot more time thinking about guardrails — what actions should the bot simply never take?

Another idea is to have a “judge LLM” that analyzes the steps the “executing LLM” took and its reasoning and decides whether that was correct or not. I see the LLM-as-a-Judge pattern being more used these days on a lot of tasks.

We also have the ability to gather data objectively. We have hundreds of thousands of legacy tests written in frameworks like Appium, Selenium, etc. Could we auto-generate a natural language test suite with the exact same tests, and run both suites in parallel for a while, comparing their results? Ideally, at some point we would have concrete evidence that we can replace the traditional tests with natural languages tests.

Will it also replace human testers? I don’t like to use the word “replace,” but it will definitely shift the work they do. In some orgs, we employ armies of manual testers that perform the same repetitive tasks to certify every release candidate. Sometimes because writing automation is too expensive; sometimes because the product changes too quickly to even write test automation; sometimes it’s lack of forwarding thinking and unwillingness to invest in engineering excellence. But any situation where a release is gated by a human being is not scalable or sustainable. It is also non-deterministic… humans are notoriously non-deterministic and make mistakes too. I would like GenAI to perform those repetitive tests, so that these testers can focus on applying their intuition and hard-earned experience in how products can fail to explore the surface more freely and more creatively.

One more consideration is that LLMs are slower to run these tests than traditional automation, and GPUs are expensive. So bots loose in both latency and cost today. But that will not always be the case. We’re looking into ways to cache things, to process in different ways, etc, to tackle both of those. I don’t want to wait until latency and cost are fully solved to take a bet on this, because if we do that, we’ll simply get started way too late. Another way of thinking about this is that the savings in human authoring and maintenance are worth the latency/cost even as of right now.

Lastly, LLM-based test execution opens up doors — for us to do things we simply couldn’t do before. A fascinating whitepaper from some researchers at Amazon describes how they created agents with diverse personas and goals for testing purposes. You can have a test with a set of instructions that can be interpreted differently depending on the persona — which allows you to discover bugs that you may not otherwise find. For example, what if one of your personas was blind? We don’t need to write special tests to validate whether the Amazon Store is accessible to people with disabilities; we simply should run all our tests, but with personas that reflect different disabilities. This detaches the test steps from the execution behavior — that is something we could never truly do before. We can add all kinds of non-functional guardrails to our already existing functional tests.

To me, the advancement of LLM-based UI-testing is not a matter of IF, it’s a matter of WHEN.

My first experience testing professional software was working at Microsoft in 1997. I installed a veeeery early debug build of Microsoft Office 2000. Excitedly, I double-clicked on the Word icon. The hard-drive frantically spun for 5 minutes, then I got a message box that it had crashed. The thing didn’t even start. It was unimpressive. The next day, I installed that morning’s build, and it opened, but crashed within 3 minutes of me playing with the app. Yet somehow, a couple of years later, we shipped that codebase (much improved) to millions of households. It took a lot of very determined engineers to try things out, learn, iterate, improve, relentlessly. Today’s world is no different.

I am both excited and terrified. But I think it’s time to jump in.