AI Skeptic Testers Are All Nuts

Published on June 3, 2025
A psychedelic testing landscape

A heartfelt provocation about AI-assisted software testing.

Tech execs are mandating LLM adoption in QA. That’s bad strategy. But I get where they’re coming from.

Some of the smartest testers I know share a bone-deep belief that AI is a fad — just the latest snake oil for test automation. I’ve been reluctant to push back on them, because, well, they’re smarter than me. But their arguments are unserious, and worth confronting. Extraordinarily talented people are doing work that LLMs already do better, out of spite.

All progress on LLMs could halt today, and LLMs would still be the 2nd most important thing to happen in software testing during my career.

Important caveat: I’m discussing only the implications of LLMs for testing. For art, music, and writing? I got nothing. For code generation? There’s nuance. But for my field — finding bugs, verifying products, making quality visible — I’m not buying the skeptics’ story.

Bona fides

I’ve been testing software since the late ’90s. Boxed C++ shrink-wrap, mainframes, operating systems, search, browsers, mobile, APIs. automation, manual, exploratory, regulated, unreleased, “too big to fail.” I’ve reported bugs no one believed, broken builds with one click, and yes, spent way too many hours on Selenium. However you define “real tester,” I qualify. Even if only on your lower tiers.

Level Setting

First, let’s get on the same page. If you tried using an LLM for testing 6 months ago and gave up, you’re not seeing what serious LLM-powered testers are doing today.

Testers using LLMs now leverage agents. These agents poke around your app or site on their own. They crawl pages, submit forms, click through flows, create data, and record results. They:

  • Extract information from the UI, APIs, logs, and docs
  • Run test harnesses and scripts
  • Execute accessibility, security, and performance scans
  • Summarize failures, group issues, and even suggest root causes

Most of the “brains” in these agents are surprisingly simple code — some Playwright or Cypress, some requests, some glue code, all orchestrated to drive ground truth. You could build a proof-of-concept agent in a weekend. The secret sauce isn’t the AI model — it’s how you wrap it around your app and your test data.

If you’re typing “find all bugs in my app” into ChatGPT and pasting output into Jira, you’re not doing what AI-driven testing looks like. No wonder you’re missing the point.

The Positive Case: Four Quadrants of Tedium and Importance

LLMs can now do a large fraction of the tedious, repeatable work of software testing — and most test execution is, in fact, tedious. LLMs reduce the number of things you have to check by hand or cross-check in test case spreadsheets. They never get tired. They don’t skip steps.

Think of the regression runs you meant to do, but didn’t. You tried to start — but who wants to click through 80 login scenarios on three browsers? You meant to recheck all the error messages or accessibility settings, but you had to ship.

LLM agents can be instructed to just do that grunt work. They’ll drop you right at the golden moment: a clean test report, a short list of actionable bugs, screenshots and logs in tow. The dopamine hit isn’t green checkmarks — it’s that all the “I should really test that” stuff is just done.

Sure, sometimes there’s gnarly, subtle stuff to test. But it’s easy to procrastinate — so you “refactor” test cases or “improve” scripts, convincing yourself it’s useful. LLMs can go through the yak-shaving for you, so you can focus on the real questions: What’s missing? What’s weird?

But You Have No Idea What the Test Is Doing

If you’re a “vibe testing” YouTuber, or can’t read test output, maybe this is a concern. Otherwise: what’s the problem?

You’ve always been responsible for what gets released. You were five years ago; you are now, whether you use AI or not.

If you let an LLM generate your test scripts or bug summaries, you need to read them. You’ll probably tweak, delete, or rephrase them. LLMs are adapting to your style — but they’re not there yet.

People complain about LLM-generated test cases being “probabilistic.” No, they aren’t. They’re tests. You can read and critique them, just like you do with other people’s work. If you can’t metabolize the boring, repetitive test cases an LLM spits out, skills issue! How do you handle “best practices” test plans from the vendor’s QA team?

But Hallucination

If hallucination (i.e., false positives/negatives) matters, your test process has already let you down.

Agents validate. They check UI, API, and logs. If the LLM invents a nonexistent field or misinterprets an error message, the framework feeds the result back, and the LLM tries again. Most frameworks run everything headlessly and check every outcome.

You’ll only notice this if you watch the chain-of-thought logs your agent generates. Don’t. The best LLM-powered test agents ask you to go get coffee and ping you when they’re done.

Hallucination is what every tester brings up first when someone suggests AI testing, but it’s a largely solved problem — if your test harness actually executes.

But the Tests Are Shitty, Like a Junior Tester’s

Does an intern cost $20/month? Because that’s what a QA Copilot subscription costs.

Being a senior tester means making less-able testers productive — whether they’re human or AI. Using LLM agents well is itself a skill, an engineering project, and a QA leadership opportunity. LLMs only produce low-quality results if you let them.

Most of what LLMs do today: run through basic flows, reproduce bugs, validate checklists, write up summaries, and grind through variations. Even the most AI-poisoned testing orgs still need humans to curate, guide, and make sense of results.

Let’s stop kidding ourselves about how “senior” our test cases really are.

But It’s Bad at (Insert Test Thing Here)

A lot of LLM skepticism is just projection. People say “AI can’t test my app” — but that’s usually because your test harness sucks, your app is impossible to automate, or your requirements are a mess. Fair! But let’s not make LLMs the scapegoat.

Some frameworks, stacks, or in-house tools are harder to automate. LLMs are better at web UI than mainframes. But the state-of-the-art improves weekly, and there’s real progress on mobile, APIs, accessibility, and even security fuzzing.

Honestly, most testers who claim AI “can’t do something” just don’t know how to get it done with an LLM — or don’t want it to work. Credit to the testing contortionists who manage to write the only prompts that fail. The real cringe is testers refusing to let AI take a shot at their “only humans can do this” work, just because they’re afraid of seeing a better answer from an LLM. This circus won’t last much longer.

If you’re running legacy SAP on Citrix, sure, LLMs won’t save you (yet). But if you’re blocking AI because “it doesn’t work with our bespoke framework,” that’s your argument, not a refutation of AI in testing.

But the Craft

Do you love writing hand-crafted exploratory charters, designing clever oracles, building the perfect test harness by hand? Me too. Do it on your own time.

I have a test lab in my garage. I could get a lot of joy from hand-writing every test. But if I need a test for “can 10,000 users log in at once?” or “does the forgot password flow work for all locales?”, I’ll take the agent every time.

Professional testers are in the business of finding risks, surfacing issues, protecting users. We’re not, day-to-day, artisans. Nobody cares if your bug report is perfectly formatted. If anything we do endures, it won’t be because the test case was beautiful.

If you’re obsessively crafting edge-case test scripts, ask yourself if you’re doing real work or just self-soothing. LLMs clear the schlep so you can dig into the judgment calls — the stuff humans are still better at.

But the Mediocrity

As a mid-late career tester, I’ve come to appreciate mediocrity. We should be so lucky to have it produced automatically, on demand.

We all write mediocre tests sometimes. Mediocre tests: often fine. Not every test needs to be a gem. If you’re spending hours perfecting login tests, you’re doing something wrong. The floor matters.

LLM-generated test cases aren’t perfect, but their “floor” is often higher than the average stale test plan in Jira. AI doesn’t skip steps, forget corner cases, or tire out at 3am before a release. Sure, their “ceiling” is lower — but the floor is way, way higher.

LLMs aren’t mediocre on every axis. They can discover and try edge cases, accessibility checks, and security vectors you might forget. They’re tireless, not imaginative.

But if all we get is reliable, repetitive coverage — that’s huge. It’s that much less schlep for human testers.

But It’ll Never Be AGI

Don’t care.

Testers get wound up by AGI/VC hype. But it’s not an argument. Things either work or they don’t. Hype doesn’t ship quality.

But They Take-rr Jerbs

So did Selenium. So did offshoring. We’re in a business premised on automating away grunt work.

“Productivity gains,” say the execs. You know what that means: fewer testers doing more testing. Ever talk to a travel agent lately? Or a night shift data entry clerk? Or a mainframe operator?

Testing jobs will change. LLMs will displace some of us. That’s not a high horse we get to ride. We’re as much in tech’s line of fire as anyone.

But the Plagiarism

I get why AI is scary for visual artists. For testers? The median test case is not some unique expression — it’s “try invalid input,” “check response time,” “verify translation.”

LLMs easily clear the industry bar for basic test content. Gallingly, they’re great at churning out just-good-enough test cases, checklists, and summaries. If you’re worried about test case plagiarism, let’s just say testers have never been paragons of IP virtue. Most “best practices” are copy-paste from StackOverflow or the ISTQB syllabus anyway.

If you don’t believe a font designer owns the curve of an “R”, you can’t get too precious about the structure of a “forgot password” test.

Positive Case Redux

Kids today don’t just use agents — they use asynchronous, parallel agents. They queue up 13 test runs, sip coffee, check notifications, triage bugs. Five are real. Five are false positives, and three get reprompted.

“My team members who aren’t using AI? It’s like they’re standing still,” a friend tells me. He doesn’t work in the Bay Area. He’s not exaggerating.

I don’t trust AI with prod access, but I’ve fed logs, test results, and even customer feedback to LLM agents, and watched them spot patterns and issues I missed.

I’m not a radical. I’m a QA classicist. But something real is happening, and my smartest friends are blowing it off. Maybe I persuade you, maybe not. But let’s be done making space for bad arguments.

But I’m Tired of Hearing About It

Me too. I read Michael Bolton, and that’s all I really need. But AI in QA is as important as Selenium was in 2010, or test automation frameworks were in 2005.

I think it’ll get clearer over the next year. The cool-kid disdain for “AI can’t test” can’t survive much more reality. I snark about the skeptics, but I mean this: they’re smart. When they get over the affectation, they’ll make test agents profoundly more effective than they are today.

TL;DR:
AI in testing isn’t perfect, but it’s not a toy. The floor is higher than before, and humans who ignore it are standing still. Use it, guide it, and get back to the fun, weird, human parts of testing — the judgment, the puzzles, the advocacy. The machines can handle the rest.

I Didn’t Write This — AI Did

Yeah, we all got a little too excited back when ChatGPT first dropped and thought it was clever to whip up a post or a slide and, only after the fact, reveal that “hey, AI wrote this.” The trick’s tired.

These ideas were originally written for software engineers, but they apply just as much to testers. I figured testers would read the original, assume they’re somehow different — specially exempt from this wake-up call — and maintain their willful idiocracy if it wasn’t written directly for them.

Thanks (retroactively) to Thomas Ptacek for the inspiration https://fly.io/blog/youre-all-nuts/ . OK, more just the text to copy/paste into an LLM and ask for a Software Tester remix. Sorry I didn’t wait for the article to be indexed by the LLMs. And thanks to LLMs for handling my blog-schlep, so I can get back to building those dangerous, good-for-nothing AI testing agents to help testers and developers with their own schlep work.

Happy testing — with LLMs if you’re smart enough to use them.

— Jason Arbon