When I first started talking about resultful testing years ago, people thought I had a typo and meant restful testing. I did not. There is a significant results gap between the level of quality a developer team without tester like me produces, even when they aren't sloppy on their quality practices. Any tester feels useful when quality is low enough to be garbage collection but when baseline quality is decent, it takes a bit more skill to produce results that are still expected.
So results are list of bugs - anything that might bug a stakeholder - that we could consider starting conversations on. If we don't see them, we can't actively decide on them.
A well-maintained illusion is both when we don't test, and when we don't test well. The latter is hard one for testers to deal with, because when your career is centered around doing something, you tend to believe you do it well. Even if you don't.
To know how you do, you would benchmark your results with known results. There's plenty of test targets out there, and I have dedicated a good 20 years in teaching testing by testing against various test targets.
Today I cycled back to run a benchmark comparing an application with known issues from a year ago when I tested it; to the same application today when I test it with AI; to results of 57 human testers who are not me but were assessed with this application to decide on their entry to particular client projects.
- human tester (past me) found 45 things to report -> 62%
- human tester with AI (today's me) found 73 things to report -> 100%
- AI with human (today's me constrained to prompting) found 40 issues to report -> 55 %
- AI without human (ask to test with playwright) found 4 issues to report -> 5%
- 57 human testers found on average 13.5 things to report -> 18%
"AI will not replace humans, but those who use AI will replace those who don’t." - Ginni Rometty, former CEO of IBM
The way the tasks were done aren't exactly the same.
- The human tester (past me) had to explore to list features while testing.
- The 57 human testers were given a requirement specification based on the exploring the human tester before them modeled making their work essentially different and, I would argue, easier.
- AI with human got to use the requirements specification and was able to find 3 more issues with it.
- AI with human used 14 prompts to get to the reported level. It was prompted to use the app, the code, the generated test automation, and each one found more unique issues.
- AI without human relied on a single prompt to use playwright to list problems.
- GitHub Copilot with Claude Opus 4.5
- Playwright Agents out-of-box files installed today
