Friday, February 13, 2026

Requirements lead into worse testing?

With the exercise on testing a version of ToDo app, any constraint handed to people as part of the assignment looks like it primes people to do worse. 

If I give people a list of requirements, they stop at having verified the requirements, and reporting something they consider of value, without considering the completeness of their perspective or complementing constraints. 

If I give people list of test cases or automation, they run those, and their entire judgement on the completeness is centered around whether those pass. 

If I give them the application without a constraint, they end up constrained with whatever people before me have asked them to do in the name of testing. All too often, it's the ask for "requirements", as a source of someone else constraint for their task at hand. 

When seeing that this year I would hand out the requirements, the feedback from a colleague was that this would make the task easier. It made it harder, I think. What I find most fascinating though is why would people think this makes it easier when it creates an incorrect illusion of a criteria to stop at. Another colleague said they vary what they do first, with or without the requirements, but they will do both. That is hard-earned experience people don't come with when they graduate. 

Giving the assignment with requirements to someone, they came with 7 and certainty of having completed the assignment where the hidden goal is at 73. Should I have primed people for success telling them they are looking for 100 pieces of feedback, prioritized by what they consider me most likely to care on. 

That is, after all, the hidden assignment when testing. Find them and decide what matters. Tell forward about what matters, for that particular context at hand. 


Thursday, February 12, 2026

Benchmarking results - Human, Human with AI, AI with Human and where we land

When I first started talking about resultful testing years ago, people thought I had a typo and meant restful testing. I did not. There is a significant results gap between the level of quality a developer team without tester like me produces, even when they aren't sloppy on their quality practices. Any tester feels useful when quality is low enough to be garbage collection but when baseline quality is decent, it takes a bit more skill to produce results that are still expected.

So results are list of bugs - anything that might bug a stakeholder - that we could consider starting conversations on. If we don't see them, we can't actively decide on them. 

A well-maintained illusion is both when we don't test, and when we don't test well. The latter is hard one for testers to deal with, because when your career is centered around doing something, you tend to believe you do it well. Even if you don't. 

To know how you do, you would benchmark your results with known results. There's plenty of test targets out there, and I have dedicated a good 20 years in teaching testing by testing against various test targets. 

Today I cycled back to run a benchmark comparing an application with known issues from a year ago when I tested it; to the same application today when I test it with AI; to results of 57 human testers who are not me but were assessed with this application to decide on their entry to particular client projects. 

  • human tester (past me) found 45 things to report -> 62%
  • human tester with AI (today's me) found 73 things to report -> 100%
  • AI with human (today's me constrained to prompting) found 40 issues to report -> 55 %
  • AI without human (ask to test with playwright) found 4 issues to report -> 5%
  • 57 human testers found on average 13.5 things to report -> 18%
The test target for this benchmark is a buggy version of ToDo App, hosted at https://todoapp.james.am/ with source code at https://github.com/TheJambo/ToDoInterviewTest and my answer key at https://github.com/QE-at-CGI-FI/todoapp-solution. The whole experiment reminds me of this quote: 

"AI will not replace humans, but those who use AI will replace those who don’t." - Ginni Rometty, former CEO of IBM

The way the tasks were done aren't exactly the same.

  • The human tester (past me) had to explore to list features while testing. 
  • The 57 human testers were given a requirement specification based on the exploring the human tester before them modeled making their work essentially different and, I would argue, easier. 
  • AI with human got to use the requirements specification and was able to find 3 more issues with it. 
  • AI with human used 14 prompts to get to the reported level. It was prompted to use the app, the code, the generated test automation, and each one found more unique issues. 
  • AI without human relied on a single prompt to use playwright to list problems. 
The AI in question is: 
  • GitHub Copilot with Claude Opus 4.5
  • Playwright Agents out-of-box files installed today
My conclusion: start using AI to do better, even without automation Github Copilot is helpful. That alone won't suffice, you need to learn to ask the right things (explore). You need to be skilled beyond asking to find the things AI won't find for you. The bar on this benchmark moves as I discover even more, or realize I should classify items differently. Not all things worth noting have yet been noted. Testing is never a promise of completeness, but we do need to raise the bar from 18% level where my control group of 57 human testers land. 

Thursday, February 5, 2026

Quality adventures vibe coding for light production

The infamous challenge to a geek with a tool: "we don't have a way of doing that". This time, I was that geek and decided it was time to vibe code an enrollment application for a single women to women vibe coding event with diversity quota for men. That was a week ago, and it's been in "production", bringing me tester-joy, allowing 38 people to enroll and 2 people run into slight trouble with it. 

What we did not have is a way for enrolling to this event so that the participant list was visible, quota of how many people joined monitored, with possibility to have a diversity quota. Honestly, we do. I just decided to fly with the problem description and go meta: vibe code enrollment application for vibe coding event. 

The first version emerged while watching an episode of Bridgerton. 

While letting the agent code it, I did some testing to discover: 

  • the added logo was reading top to bottom instead of left to right in Safari. I know because I tested. 
  • boundary of 17 is 17, and boundary of 3 is 4. Oh wait, no? Comparing two things the same way and one is different and wrong.
  • removing a feature, emails, was harder than it should be. When it was removed, there are three places to take it away from and of course it was left in one and nothing works momentarily. 
  • there was no retry logic for saving to database - who needs that right? Oh, the two people whose enrollment will get lost and they tell me about it.
With that information, off to production we go. True to vibe coding mentality, there is one environment: production. People enroll. Two people reach out that they are sure they enrolled but their names vanished. I was sure that was true and lovely that they let me know, but they managed to enroll, just a few places lower in the waiting list than what would have been their fair positioning. 

Less than a week later, the first event is almost fully booked and the queue is longer than what fits in an event. We decide a second session is in order, and that's back to more vibe coding. Adding two features, selection of day from two options, and time-limited priority enrollment for those already in queue for first event who want to join the second. 


Armed with some extra motivation for the final episode of Bridgerton, I decide to go for the known bug too. Asking for fix requires knowing what fix to ask for. And while at it, I asked for some programmatic tests too. And reading security warnings on Supabase. Turns out that combo made me lose all my production data in an 'oops', since one browser held a full list of participants while one showed none, and as it happens the application removed all lines from database to insert new lines based on what was in local storage. For a moment I thought this was due to removing anonymous access to delete in the database. 

So added more bugs that needed addressing while at this a second time: 
  • every users local storage contents overwrote whatever was in database by then. Almost as if it was a single user application!
  • there were many error cases that needed handling on the write to database failing in the first place to not lose people's enrollment information. They did find it (not surprising) and got in touch for it (surprising). I guess DDoS day was not in my plans when expecting network reliability. 
  • security alerts (that I read) in supabase alerted me on security misconfiguration with out of box version. 
  • Lost all data for testing in production! At least I had a backup copy of all but one row of data. 
Funny how it took me two sessions to start missing discipline of testing: separate test environment; repeatable tests in the repo. 

I'd like to think that I am more thoughtful when turning from vibe coding on hobby time to doing AI-driven software development at work. The difference in those two feel a whole lot like great exploratory testing.