Monday, March 2, 2026

4x improvement, 13x improvement

The world is awfully abstract these days. Everyone and their uncle seems to know what "testing" is, and a good half of them would volunteer to explain it to me, in detail. From those conversations, I still conclude it's awfully abstract. 

When one person talks of testing, their days actually look like coordination. Ensuring pieces of the whole become "ready" is a pattern that allows for seeing if they work together. Addressing feedback that does not need change of state for the ticket, that needs returning it to developer or that needs needs to addressed separately either as change management proposal, or as a problem that we could wait with a while. Now that we are moving from "testing" to "quality engineering", and "shifting left" + "shifting right", it's increasingly unclear what of it is testing. And frankly, that does not matter as much. 

What matters is quality results, and our signal of knowing. Calling that "abracadabra" might be sometimes helpful. 

A similar conversation is one around productivity, that is awfully abstract. When we compare productivity, particularly for improvement, the comparisons are local to a task rather than evidences of approaches in scale. But I wanted to tell you two pieces of those comparisons. 

The 4x story

With being a consultant, a thing that comes sometimes is that the client needs to react to a diminishing budget. They used to have a team of 5, and now they can have a team of 3.5. People aren't half. You can budget and pay them half, but they tend to find it hard to do equal amount with their other half. You might also think that the half includes you being able to ask any time, effectively leading to full time service to pull with half the price. I have found these particularly fascinating while in consulting. 

From the fascination, I asked a colleague to revisit their output / value with numbers from a sampled timeframe. One month, separated by a year. One month where their work was 100%, and one month where their work was 50%. We know the incoming money is cut by half. What happened to the value delivered to the client? 

Turns out that the monthly output / value has doubled. In the two months compared, roughly same amount of changes were made to the software and tested. Roughly same amount of changes were returned to sender for undone work. Same work for half the price! 

But there were additional things. In the 50% timeframe, the team's tester had volunteered on specification work and that too fit the time. With clarity and ease for testing of those features where beginning and end meet, we could argue a factor of 2x on output / value. 

What really changed: 

  • The later time includes less learning of the system, as the system is incrementally being changed and the same person has been acquiring the layered knowledge. 
  • Learning happens on the other 50% that is not paid, and the value of the growth of that other 50% shows up in the paid work. The whole quality engineering transformation thinking enabling the improvements is no unpaid. 
  • Risk-based change testing on assigning varied size timeboxes of effort rather than applying a standard filtering was necessary to the budget cuts, but no risks of missing with incorrect decisions have yet been noticed. 
  • Finding issues on software at large rather than on the specific change has been cut down, but it may also be naturally cut away by the continued years on the same application. 
  • Releases aren't the same in the two months: one had hotfix and one a release. We know those are testing where visible results are not expected but effort is significant. Smaller batch size keeps all releases in the hotfix kind of process. 
  • Active time on testing is the only real way to talk about coverage of testing. That really needed optimizing to not lose out on results. 
In analyzing this, we concluded 4x improvement that would only be possible when the cognitive load of the other assignment for the full person is in support of this assignment. Improvement comes from growth: learning risk-based change testing, specifications as means over task expansion, better forced boundaries of entry quality. It is likely not sustainable when a second project requiring application specific learning comes through, and it is operating on a risk limit. 

I also wonder: should we have been able to cut the output / value in half, and communicate the investments we ended up making in improving the work. If we don't, people end up thinking the best way to improve testing is to basically force a 50% fee reduction. While it happened this time, it also happened this time since a significant collaboration and coaching effort on updating the ways we think about testing is ongoing. 

The 13x story

Another output / value -based comparison on test automation work shows 13x improvement. This comes from comparing 24 weeks of test automation without Github Copilot Agents and 3 weeks of test automation with those. The amount of test automated per time unit: 13x improvement. 

Again, we are comparing things with a lot more variables than just Github Copilot Agents. The main variables are:
  1. Learning. It slows you down. The first number includes learning what test automation is, what the test target is, what the team is. 
  2. Pressure. On the latter timeframe, it was either finding success or end of the work. 
While the 13x sounds great and fancy, it comes from intentionally comparing apples and oranges, so to speak. 

Having people's attention with the number, let's address the real deal:
  • Github Copilot Agentic use was far from obvious. It needed metrics to guide it, expected behaviors of tool user addressed in pair and ensemble testing sessions. And it needed time to sink in. 
  • Learning to walk while having a bike next to you just won't work. We are really seeing the power of learning making a 13x impact, rather than really Github Copilot.
  • Github Copilot agentic use relied on the hyper-reusability built on the first segment of time. Creating next ones when code can be reused is a lot more straightforward. 
  • Drift of ready. On the first batch, I was around to analyze in detail if this matches the expected test. On second batch, I haven't looked at the steps yet. There is a chance of appearance of progress by shortcuts that easily happen because pressure
Conclusions

I won't make the best tool salesman with how critically I look at the improvements. But I will be great at delivering improving value. Perhaps that should be the target, after all? 

Friday, February 13, 2026

Requirements lead into worse testing?

With the exercise on testing a version of ToDo app, any constraint handed to people as part of the assignment looks like it primes people to do worse. 

If I give people a list of requirements, they stop at having verified the requirements, and reporting something they consider of value, without considering the completeness of their perspective or complementing constraints. 

If I give people list of test cases or automation, they run those, and their entire judgement on the completeness is centered around whether those pass. 

If I give them the application without a constraint, they end up constrained with whatever people before me have asked them to do in the name of testing. All too often, it's the ask for "requirements", as a source of someone else constraint for their task at hand. 

When seeing that this year I would hand out the requirements, the feedback from a colleague was that this would make the task easier. It made it harder, I think. What I find most fascinating though is why would people think this makes it easier when it creates an incorrect illusion of a criteria to stop at. Another colleague said they vary what they do first, with or without the requirements, but they will do both. That is hard-earned experience people don't come with when they graduate. 

Giving the assignment with requirements to someone, they came with 7 and certainty of having completed the assignment where the hidden goal is at 73. Should I have primed people for success telling them they are looking for 100 pieces of feedback, prioritized by what they consider me most likely to care on. 

That is, after all, the hidden assignment when testing. Find them and decide what matters. Tell forward about what matters, for that particular context at hand. 


Thursday, February 12, 2026

Benchmarking results - Human, Human with AI, AI with Human and where we land

When I first started talking about resultful testing years ago, people thought I had a typo and meant restful testing. I did not. There is a significant results gap between the level of quality a developer team without tester like me produces, even when they aren't sloppy on their quality practices. Any tester feels useful when quality is low enough to be garbage collection but when baseline quality is decent, it takes a bit more skill to produce results that are still expected.

So results are list of bugs - anything that might bug a stakeholder - that we could consider starting conversations on. If we don't see them, we can't actively decide on them. 

A well-maintained illusion is both when we don't test, and when we don't test well. The latter is hard one for testers to deal with, because when your career is centered around doing something, you tend to believe you do it well. Even if you don't. 

To know how you do, you would benchmark your results with known results. There's plenty of test targets out there, and I have dedicated a good 20 years in teaching testing by testing against various test targets. 

Today I cycled back to run a benchmark comparing an application with known issues from a year ago when I tested it; to the same application today when I test it with AI; to results of 57 human testers who are not me but were assessed with this application to decide on their entry to particular client projects. 

  • human tester (past me) found 45 things to report -> 62%
  • human tester with AI (today's me) found 73 things to report -> 100%
  • AI with human (today's me constrained to prompting) found 40 issues to report -> 55 %
  • AI without human (ask to test with playwright) found 4 issues to report -> 5%
  • 57 human testers found on average 13.5 things to report -> 18%
The test target for this benchmark is a buggy version of ToDo App, hosted at https://todoapp.james.am/ with source code at https://github.com/TheJambo/ToDoInterviewTest and my answer key at https://github.com/QE-at-CGI-FI/todoapp-solution. The whole experiment reminds me of this quote: 

"AI will not replace humans, but those who use AI will replace those who don’t." - Ginni Rometty, former CEO of IBM

The way the tasks were done aren't exactly the same.

  • The human tester (past me) had to explore to list features while testing. 
  • The 57 human testers were given a requirement specification based on the exploring the human tester before them modeled making their work essentially different and, I would argue, easier. 
  • AI with human got to use the requirements specification and was able to find 3 more issues with it. 
  • AI with human used 14 prompts to get to the reported level. It was prompted to use the app, the code, the generated test automation, and each one found more unique issues. 
  • AI without human relied on a single prompt to use playwright to list problems. 
The AI in question is: 
  • GitHub Copilot with Claude Opus 4.5
  • Playwright Agents out-of-box files installed today
My conclusion: start using AI to do better, even without automation Github Copilot is helpful. That alone won't suffice, you need to learn to ask the right things (explore). You need to be skilled beyond asking to find the things AI won't find for you. The bar on this benchmark moves as I discover even more, or realize I should classify items differently. Not all things worth noting have yet been noted. Testing is never a promise of completeness, but we do need to raise the bar from 18% level where my control group of 57 human testers land. 

Thursday, February 5, 2026

Quality adventures vibe coding for light production

The infamous challenge to a geek with a tool: "we don't have a way of doing that". This time, I was that geek and decided it was time to vibe code an enrollment application for a single women to women vibe coding event with diversity quota for men. That was a week ago, and it's been in "production", bringing me tester-joy, allowing 38 people to enroll and 2 people run into slight trouble with it. 

What we did not have is a way for enrolling to this event so that the participant list was visible, quota of how many people joined monitored, with possibility to have a diversity quota. Honestly, we do. I just decided to fly with the problem description and go meta: vibe code enrollment application for vibe coding event. 

The first version emerged while watching an episode of Bridgerton. 

While letting the agent code it, I did some testing to discover: 

  • the added logo was reading top to bottom instead of left to right in Safari. I know because I tested. 
  • boundary of 17 is 17, and boundary of 3 is 4. Oh wait, no? Comparing two things the same way and one is different and wrong.
  • removing a feature, emails, was harder than it should be. When it was removed, there are three places to take it away from and of course it was left in one and nothing works momentarily. 
  • there was no retry logic for saving to database - who needs that right? Oh, the two people whose enrollment will get lost and they tell me about it.
With that information, off to production we go. True to vibe coding mentality, there is one environment: production. People enroll. Two people reach out that they are sure they enrolled but their names vanished. I was sure that was true and lovely that they let me know, but they managed to enroll, just a few places lower in the waiting list than what would have been their fair positioning. 

Less than a week later, the first event is almost fully booked and the queue is longer than what fits in an event. We decide a second session is in order, and that's back to more vibe coding. Adding two features, selection of day from two options, and time-limited priority enrollment for those already in queue for first event who want to join the second. 


Armed with some extra motivation for the final episode of Bridgerton, I decide to go for the known bug too. Asking for fix requires knowing what fix to ask for. And while at it, I asked for some programmatic tests too. And reading security warnings on Supabase. Turns out that combo made me lose all my production data in an 'oops', since one browser held a full list of participants while one showed none, and as it happens the application removed all lines from database to insert new lines based on what was in local storage. For a moment I thought this was due to removing anonymous access to delete in the database. 

So added more bugs that needed addressing while at this a second time: 
  • every users local storage contents overwrote whatever was in database by then. Almost as if it was a single user application!
  • there were many error cases that needed handling on the write to database failing in the first place to not lose people's enrollment information. They did find it (not surprising) and got in touch for it (surprising). I guess DDoS day was not in my plans when expecting network reliability. 
  • security alerts (that I read) in supabase alerted me on security misconfiguration with out of box version. 
  • Lost all data for testing in production! At least I had a backup copy of all but one row of data. 
Funny how it took me two sessions to start missing discipline of testing: separate test environment; repeatable tests in the repo. 

I'd like to think that I am more thoughtful when turning from vibe coding on hobby time to doing AI-driven software development at work. The difference in those two feel a whole lot like great exploratory testing. 





Wednesday, January 28, 2026

The Box and The Arrow.

A lot of what I used to write in my blog, I find myself writing as a LinkedIn post. That is not the greatest of strategies given the lack of permanence to anything you post there, so I try to do better.  

My big insight last week: the box and the arrow. Well, this is actually D and R from DSRP toolset for systems thinking applied to value generation. In an earlier place of work when I hired consultant, they came with very different ideals of hour reporting that then mapped to cost. 

One invoiced always full hours unless sick. Being a CEO of their own company, I am sure their days included something (if nothing else, tiredness) other than our work, but I had their attention and was happy with the outcomes. I framed it as "pay for value". 

One invoiced always half hours, but produced as much value as the first one. They wanted to emphasize that the other work they did for service sales, upskilling themselves, and common tools development they used but help IP for wasn't ours to pay for. That was fine too, and I framed it as "unique access to top talent for our benefit". 

One invoiced 100% of hours, including hours their company used on managing them. No other responsibilities, their work existence was in service of us. That too was fine, made it simple for them to make sense of shaping their capabilities. I framed that as "do you even want to keep them if they are not getting trained/coached". 

The cost to value for these three was very different, depending on what I got in the service for free. 

So I have modeled now the box and the arrow. The box is the service. The arrow is the relationship that transforms the service. One of the things in our arrow is how we collect information, in 2025 we interviewed more than 1,800 business and technology executives. This is not a "fill this questionnaire". This is us sending our top management to discuss things with a lot of people, every day of the year, systematically. Listing things I consider our arrow that are easy to copy wouldn't be fair for me to do. But the list is long. And a lot of that work in the arrow is paid from whatever the margins enable. 




We need better ways of comparing services. Maybe, just maybe, the question of "what you give us for free" is more insightful than I first thought. 

A simplified analogy to why people shop at Prisma or Citymarket: the smiling greeting. It does not change the sausage they buy, but it changes their experience while buying the sausage.

Friday, January 16, 2026

The Results Gap

Imagine you are given an application to test, no particular instructions. Your task, implicitly, is to find some of what others have missed. If quality is great, you have nothing to find. If testing done before is great, none of the things you find surprise anyone. Your work, given that application to test is figure out that results gap, and if it exists in the first place.

You can think of the assignment as being given a paper with text written in invisible ink. The text is there, but it takes special skill to turn that to visible. If no one cares what is written on the paper, the intellectual challenge alone makes little sense. Finding some of what others have missed, of relevance to the audience asking you to find information is key. Anything extra is noise.

Back in the days of some projects, the results gap that we testers got to work with was very significant, and we learned to believe developers are unable to deliver quality and test their own things. That was a self-fulfilling prophecy. The developers "saving time" by "using your time" did not actually save time, but it was akin to a group of friends eating pizza and leaving the boxes around, if someone did not walk around pointing and reminding of the boxes. We know we can do better on basic hygiene, and anyone can point out pizza boxes. It may be that there is other information everyone won't notice, but one reminder turned to a rule works nicely on making those agreements in our social groups. With that, the results gap got to be the surprises.

Results gap is space between two groups having roughly the same assignment, but providing different results. Use of time leads to the gap, because 5 minute unit testing and 50 minute unit testing tend to allow for different activity. Availability of knowledge leads to the gap, because even with time you might not note problems without a specific context of knowledge. Availability to production like environments and experiences leads to the gap, both by not recognizing what is relevant for the business domain but even being able to see it due to missing integrations or data.

Working with the results gap can be difficult. We don't want us using so much time on testing that was already someone else's responsibility. Yet, we don't want to leak the problems to production, and we expect the last group assigned responsible to testing to filter out as much of what the others missed as possible. And we do this best by sizing the results gap, and making it smaller, usually through coaching and team agreements.

For example, realizing that by testing and reporting bugs, our group was feeding the existence of the results gap lead to a systemic change. Reporting bugs by pairing to fix them helped fix the root cause of the bugs. It may have been extra effort on testing on our group, but saved significant time in avoiding rework.

Results gap is a framing used for multiple groups agreed responsibilities towards quality and testing. If no new information surprises you production time, your layered feedback mechanisms bring you good enough quality (scoping and fixing enough) with good enough testing (testing enough). Meanwhile, my assignments as a testing professional are framed in contemporary exploratory testing, where I combine testing, programming and collaboration to create a system of people and responsibilities where quality and testing leaves less of a results gap for us to deal with.

Finally, I want to leave you with this idea: bad testing, without results, is still testing. It just does not give much of any of the benefits you could get with testing. Exploratory testing and learning actively transforms bad testing to better. Coverage is focused on walking with potential to see, but for results, you really need to look and see the details that the sightseeing checklist did not detail.

Tuesday, January 6, 2026

Learning, and why agency matters

Some days Mastodon turns out to be a place of inspiration. Today was one of those. 

It started with me sharing a note from day-to-day at work, that I was pondering on. We have a 3 hour Basic and Advanced GitHub Copilot training organized at work that I missed, and I turned to my immediate team asking 1-3 insights of what they learned as they were at the session. I knew they were at the session because I had approved hours that included being in that session. 

I asked as a curious colleague, but I can never help being also their manager at the same time. The question was met with silence. So I asked a few of the people one on one, to learn that they had been in the session but zoned out for various reasons. Some of the reasons included having hard time to relate to the content as it was presented, the French-English accent of the presenters, getting inspired by details that came in too slow taking time to search information online on the side and just that the content / delivery was not particularly good. 

I found it fascinating. People take 'training' and end up not being trained on the topic they were trained on, to a degree they can't share one insight the training brought them. 

For years, I have been speaking on the idea of agency, sense of being in control, and how important that is for learning-intensive work like software testing. Taking hours for training and thinking about what you are learning is a great way of observing agency in practice. You have a budget you control, and a goal of learning. What do you do with that budget? How do you come out having used that budget as someone who know has learned? It is up to you.

In job interviews when people don't know test automation, they always say "but I would want to learn". Yet when looking back at their past learning in space of test automation, I often find that the "I have been learning in past six months" ends up meaning they have invested time in watching videos, without having being able to change anything in their behaviors or attain knowledge. They've learned awareness, not skills or habits. My response to claims of learning in the past is to ask for something specific they have been learning, and then asking to see if they now know how to do it in practice. Most recent example in this space was me asking four senior test automator candidates on how to run robot framework test cases I had in IDE - 50% did not know how. We should care a bit more about our approaches to learning in terms of it is impactful. 

So these people, now including me, had the opportunity of investing 3 hours to learning GitHub Copilot. Their learning approach was heavily biased on the course made available. But with strong sense of agency, they could do more.

They could:

  • actively seek the 1-3 things to mention from their memories 
  • say they didn't do the thing and in the same time they did Y and learned 1-3 things to mention
  • not report the hours into training even if the video was playing while they did something completely unrelated
  • stop watching the online session and wait for video to have control over speed and fast-forwarding to relevant pieces
  • ...

In the conversations on Mastodon, I learned a few things myself. I was reminded that information intake is a variable I can control from high sense of agency in my learning process. And I learned there is a concept of 'knowledge exposure grazing' where you are snacking information, and it is a deliberate strategy for a particular style of learning. 

Like with testing, being able to name our strategies and techniques allows us control and explainability to what we are doing. And while I ask as a curious colleague / manager, what I really seek is more value for the time investment. If your learning teaches others in a nutshell, you are more valuable. If your learning does not even teach you, you are making poor choices. 

Because it's not your company giving you the right trainings, it's you choosing to take the kinds of trainings in the style that you know works for you. Through experimentation you learn what are the variables you should tweak. And that makes you a better learner, and a better tester.