Thursday, April 2, 2026

Making teams awesome - or getting hurt trying it?

28 years a tester. Sure, I have held all kinds of fancy monikers on top of that,  but that is what I find myself to be at a core. I learned testers break illusions, not software. I learned that I could still hold the idea of being a tester when the illusions I break go beyond the product, to ways of working, to people, and even patience of people. I mean to say I test patience with a lighthearted smile, but it holds a piece of truth. Testers aren't always easy people. I have been many things, but easy is not one of those. Persistent. Reflective. Endlessly curious. For a long time, I believed the value I brought in as a tester was founded on two things: 

  • serendipity: "The more I practice, the luckier I get."
  • perseverance: "It's not that I'm so smart, I just stick with the problems longer."

I would get to travel the world to deliver talks with titles such as "Making teams awesome". I would observe roles of catalyst, conscience, cheerleader and critical thinker in my work with the teams. 

Today, I write this post wondering if the world changed, if I changed, or if the context changed. I received a fairly substantial piece of feedback that exemplifies something I did to help as something that breaks the team. We need to talk about failures, and we can't talk about failures of others. So I talk about my failures, in spirit of truly believing in blamelessness through seeing things in systemic context. 

I worked with a particular team, set up for a short period of time. My title: test automation engineer. The task: planning. The framing: agile with named roles. 

Reflecting on what happened, I went for a generated illustration. While my intent may be to help in making the team awesome, when that did not succeed within constraints available, I became a canary in the coal mine. Powerless to fix, destined to get hurt while being trapped on a purpose I could not escape. 


The experience of feedback that threatens professional integrity is still raw and recent, yet I appreciate the learning experience. I am not yet sure if my conclusion would be to 

  1. Invest differently in preparation as I see canary mode activated
  2. Walk away for greater good as I see canary mode activated
  3. Learn to block canary mode even when it feels necessary for a greater cause

I will continue working on finding better ways of interacting in teams, as a tester. Or as me. The feedback would have been very different for same behavior in the team if they did not frame me as test automation engineer. Authenticity is confrontational to a world built on masks, but for this team, in this context, I really needed a mask. 

The fascinating lessons work has to offer are many fold. This was definitely the most strenuous job interview I have ever been at, let alone failed at. Failed together, for systemic reasons, but failed none the less. An experience richer. 

I'm sure I am not the only tester in the world that balances the feedback and actions that would lead to quality, enough to become nominated as the reason for lack of quality. 




Tuesday, March 24, 2026

Seeing systems in people's choices

LinkedIn as a platform annoyed me enough to break my let people be wrong on the internet -principle, on some smaller company leaders trash talking people on other companies payroll. Stories like this are popular fodder for a lot of toxicity: 

The leadership of a big consulting firm had gone to visit a major client to “sell” an AI transformation. The only problem was that the firm’s own consultants had spent the past year refusing to adopt AI tools, even though the client had been offering them training and tools 😂 A friend of mine at that company just said: “We already bought this a long time ago—go sell it to your own people so we can actually start using it.”

I felt like offering a perspective of experience: 

  1. I have been on the receiving end of a client manager boastfully telling me this. Thanks, I can see the prep you did on delivering that line of insult. 
  2. I have seen that the same client's organization expects a lot of uninvoiced work on planning future major changes. They consider it part of our sales work. I consider it some of our best people doing a lot of work from promise of future profits. 
  3. I have seen that the same client's organization brings hourly rate down by competition, to a level where I cannot work a single hour without losing money for my company. And no, I am not that expensive. Their hourly rates are that poor. We have a blended rate so me losing money on every hour does not matter any more than the uninvoiced sales and management work. 
  4. These are expected disproportionately from larger organizations. 
Coming to the conversation of offering them training and tools often misses the idea of making space to learn these things. Even when we pay for the training time, none of the other commitments that are paid are flexible. We are probably already on a tight delivery project. We are probably already accommodating growing understanding as scope does not creep that just creates more work. 

Expecting everything AND a change at the same time without making space for people just does not work. 

It might be "why can't you spend two weeks trying out Browserstack" or "why can't you just start using AI", usually from people who neither work regular hours (me!), nor on regular speed (again, me!), within contractually micromanaged responsibilities (sometimes me!). 

There is this Knoster change management model: 

  • ❌ No vision → People don’t understand why → Wrong or chaotic change
  • ❌ No motivation/incentives → People don’t care → Slow change
  • ❌ No skills → People feel incapable → Anxiety
  • ❌ No resources → People want to change but can’t → Frustration
  • I had a picture of that too, but someone has decided even linking images already uploaded online to my blog on this computer is unacceptable to protect the large numbers of clients I care for. 

    I suspect more than one of these are missing, and the lack of empathy on the system you have contributed in creating and are maintaining on social media might be at play with your results. 

    People have reasons, and ideology is a relevant one, too. 

    Stop with the prompting, be clear on your intent

    I sat at a table with a group of developers, somewhere in California a decade ago. We were all keenly watching a piece of java code we had just collegially written into a test to describe inputs and outputs to a function we were planning to implement for training purposes. The length of the conversation of the design we could see just from the method signature is something I remember. We could design all this before implementing any of it.

    It took a while and a lot of opportunities to reflect for me to eventually get to the idea of unit testing and test-driven development. There is a lot of power in expressing your intent, and reviewing it before taking steps further. 

    What made intentional programming really click for me was teaching kids in this style. If you could express yourself on what you want (what you really really want - could not resist) in your local language, translate to English and translate to code, it created a flow that made sense. So I spent some time teaching kids and women over forty, back when that felt like a cause I would dedicate time on. 

    From intentional programming, I combined this with exploratory testing, and taught exploring with intent and shared notes on learning about navigating with intent - location - details hierarchy for ensemble testing. I learned to speak a little better with the intent I had with self-forced repetition of clearly needing some practice. 

    Ensemble testing made me really great at exploratory testing. I got to learn from hundreds of people on how they actually do things, not just on how they say they would do things. I failed a lot, succeeded a lot, and learned even more. 

    When I sat at that table in California, I had no idea that expressing intent would soon be called "prompting", and the whole world would be obsessed on finding a ways of clearly expressing intent in average language that, when enriched, would produce increments of code that fits to the expression of intent. 

    We may call it "zero-shot prompting" when we don't give an example to illustrate our intent, and "one-shot prompting" when we give an example, but we should already remember how we people discuss things: an example would be helpful, right about here. 

    Express your intent. If that is best done by saying "As an expert exploratory tester...", do so. Paying attention to your use and average use of language feels relevant now. Don't search for prompt engineering techniques, that is what we have been building into AI tools for last years - overriding your attempts of prompt engineering to give you things you did not know to ask for. We may call that context engineering but it still feels like a bigger prompt, just not one you can fully control. 

    Express what you want with intent, and you may just get something in that neighborhood. And what you get may be good enough, or a good starting point. Sometimes getting recognizably bad things is just the external imagination you need to get your real intent out.  

    Monday, March 9, 2026

    Why would I even want to generate test cases with AI?

    There is a conversation I keep having on use of AI, where people ask me about generating test cases with AI and I explain them that test cases was never the thing we wanted in the first place. Puzzled? You might not be yet aware of what exploratory testing really is if you are, and if you were, you would be better equipped for the AI transformation in testing we are currently in. 

    Testing is not about executing test cases. It is about finding information - some of what others may not know that is of relevance. Test cases are not ideas of what to test, but ideas turned into step by step instructions. 

    Let's not confuse automated test cases to test cases through. Automated test cases are captured programmatic steps that enable repeating the steps as executable documentation. Unlike test cases, automated test cases are not just repeating the same steps, but also a foundation for extending with new data, mixing step orders and thus discovering ways of not stepping instructions to miss bugs. And their design shifts testing down to a faster cycle of feedback. 

    We need testing to find information, but we don't need test cases. This is what we see with AI. Ask for bugs, and you get bugs. Why would you ask test cases when you wanted to know what information at least could use addressing? Ask for bugs many different ways, and all the layers might give a good level of testing done. 

    When we need documentation for future, we capture that as output. And we treat automation as a first class citizen as creating that asset for documentation the texts are generated from - not the other way around. 

    When I joined CGI Finland as Director of AI-Assisted Application Testing nearly two years ago, I had a hunch that the transformation expected included reframing testing, and supporting it with new kinds of approaches. Turns out that some of the better outcomes are founded on ideas of contemporary exploratory testing. When packaged, we call it Test Intelligence Mesh. Fancy names aside, this is new rules on how testing is done, AI assisted, with packaged skills of great exploratory testing. One experiment at a time, collecting and sharing our toolbox that we can leave behind and scale. 

    I don't generate test cases, I generate valuable results of testing. And I would be a fool doing that without AI-support that helps me scale. 

    Monday, March 2, 2026

    4x improvement, 13x improvement

    The world is awfully abstract these days. Everyone and their uncle seems to know what "testing" is, and a good half of them would volunteer to explain it to me, in detail. From those conversations, I still conclude it's awfully abstract. 

    When one person talks of testing, their days actually look like coordination. Ensuring pieces of the whole become "ready" is a pattern that allows for seeing if they work together. Addressing feedback that does not need change of state for the ticket, that needs returning it to developer or that needs needs to addressed separately either as change management proposal, or as a problem that we could wait with a while. Now that we are moving from "testing" to "quality engineering", and "shifting left" + "shifting right", it's increasingly unclear what of it is testing. And frankly, that does not matter as much. 

    What matters is quality results, and our signal of knowing. Calling that "abracadabra" might be sometimes helpful. 

    A similar conversation is one around productivity, that is awfully abstract. When we compare productivity, particularly for improvement, the comparisons are local to a task rather than evidences of approaches in scale. But I wanted to tell you two pieces of those comparisons. 

    The 4x story

    With being a consultant, a thing that comes sometimes is that the client needs to react to a diminishing budget. They used to have a team of 5, and now they can have a team of 3.5. People aren't half. You can budget and pay them half, but they tend to find it hard to do equal amount with their other half. You might also think that the half includes you being able to ask any time, effectively leading to full time service to pull with half the price. I have found these particularly fascinating while in consulting. 

    From the fascination, I asked a colleague to revisit their output / value with numbers from a sampled timeframe. One month, separated by a year. One month where their work was 100%, and one month where their work was 50%. We know the incoming money is cut by half. What happened to the value delivered to the client? 

    Turns out that the monthly output / value has doubled. In the two months compared, roughly same amount of changes were made to the software and tested. Roughly same amount of changes were returned to sender for undone work. Same work for half the price! 

    But there were additional things. In the 50% timeframe, the team's tester had volunteered on specification work and that too fit the time. With clarity and ease for testing of those features where beginning and end meet, we could argue a factor of 2x on output / value. 

    What really changed: 

    • The later time includes less learning of the system, as the system is incrementally being changed and the same person has been acquiring the layered knowledge. 
    • Learning happens on the other 50% that is not paid, and the value of the growth of that other 50% shows up in the paid work. The whole quality engineering transformation thinking enabling the improvements is no unpaid. 
    • Risk-based change testing on assigning varied size timeboxes of effort rather than applying a standard filtering was necessary to the budget cuts, but no risks of missing with incorrect decisions have yet been noticed. 
    • Finding issues on software at large rather than on the specific change has been cut down, but it may also be naturally cut away by the continued years on the same application. 
    • Releases aren't the same in the two months: one had hotfix and one a release. We know those are testing where visible results are not expected but effort is significant. Smaller batch size keeps all releases in the hotfix kind of process. 
    • Active time on testing is the only real way to talk about coverage of testing. That really needed optimizing to not lose out on results. 
    In analyzing this, we concluded 4x improvement that would only be possible when the cognitive load of the other assignment for the full person is in support of this assignment. Improvement comes from growth: learning risk-based change testing, specifications as means over task expansion, better forced boundaries of entry quality. It is likely not sustainable when a second project requiring application specific learning comes through, and it is operating on a risk limit. 

    I also wonder: should we have been able to cut the output / value in half, and communicate the investments we ended up making in improving the work. If we don't, people end up thinking the best way to improve testing is to basically force a 50% fee reduction. While it happened this time, it also happened this time since a significant collaboration and coaching effort on updating the ways we think about testing is ongoing. 

    The 13x story

    Another output / value -based comparison on test automation work shows 13x improvement. This comes from comparing 24 weeks of test automation without Github Copilot Agents and 3 weeks of test automation with those. The amount of test automated per time unit: 13x improvement. 

    Again, we are comparing things with a lot more variables than just Github Copilot Agents. The main variables are:
    1. Learning. It slows you down. The first number includes learning what test automation is, what the test target is, what the team is. 
    2. Pressure. On the latter timeframe, it was either finding success or end of the work. 
    While the 13x sounds great and fancy, it comes from intentionally comparing apples and oranges, so to speak. 

    Having people's attention with the number, let's address the real deal:
    • Github Copilot Agentic use was far from obvious. It needed metrics to guide it, expected behaviors of tool user addressed in pair and ensemble testing sessions. And it needed time to sink in. 
    • Learning to walk while having a bike next to you just won't work. We are really seeing the power of learning making a 13x impact, rather than really Github Copilot.
    • Github Copilot agentic use relied on the hyper-reusability built on the first segment of time. Creating next ones when code can be reused is a lot more straightforward. 
    • Drift of ready. On the first batch, I was around to analyze in detail if this matches the expected test. On second batch, I haven't looked at the steps yet. There is a chance of appearance of progress by shortcuts that easily happen because pressure
    Conclusions

    I won't make the best tool salesman with how critically I look at the improvements. But I will be great at delivering improving value. Perhaps that should be the target, after all? 

    Friday, February 13, 2026

    Requirements lead into worse testing?

    With the exercise on testing a version of ToDo app, any constraint handed to people as part of the assignment looks like it primes people to do worse. 

    If I give people a list of requirements, they stop at having verified the requirements, and reporting something they consider of value, without considering the completeness of their perspective or complementing constraints. 

    If I give people list of test cases or automation, they run those, and their entire judgement on the completeness is centered around whether those pass. 

    If I give them the application without a constraint, they end up constrained with whatever people before me have asked them to do in the name of testing. All too often, it's the ask for "requirements", as a source of someone else constraint for their task at hand. 

    When seeing that this year I would hand out the requirements, the feedback from a colleague was that this would make the task easier. It made it harder, I think. What I find most fascinating though is why would people think this makes it easier when it creates an incorrect illusion of a criteria to stop at. Another colleague said they vary what they do first, with or without the requirements, but they will do both. That is hard-earned experience people don't come with when they graduate. 

    Giving the assignment with requirements to someone, they came with 7 and certainty of having completed the assignment where the hidden goal is at 73. Should I have primed people for success telling them they are looking for 100 pieces of feedback, prioritized by what they consider me most likely to care on. 

    That is, after all, the hidden assignment when testing. Find them and decide what matters. Tell forward about what matters, for that particular context at hand. 


    Thursday, February 12, 2026

    Benchmarking results - Human, Human with AI, AI with Human and where we land

    When I first started talking about resultful testing years ago, people thought I had a typo and meant restful testing. I did not. There is a significant results gap between the level of quality a developer team without tester like me produces, even when they aren't sloppy on their quality practices. Any tester feels useful when quality is low enough to be garbage collection but when baseline quality is decent, it takes a bit more skill to produce results that are still expected.

    So results are list of bugs - anything that might bug a stakeholder - that we could consider starting conversations on. If we don't see them, we can't actively decide on them. 

    A well-maintained illusion is both when we don't test, and when we don't test well. The latter is hard one for testers to deal with, because when your career is centered around doing something, you tend to believe you do it well. Even if you don't. 

    To know how you do, you would benchmark your results with known results. There's plenty of test targets out there, and I have dedicated a good 20 years in teaching testing by testing against various test targets. 

    Today I cycled back to run a benchmark comparing an application with known issues from a year ago when I tested it; to the same application today when I test it with AI; to results of 57 human testers who are not me but were assessed with this application to decide on their entry to particular client projects. 

    • human tester (past me) found 45 things to report -> 62%
    • human tester with AI (today's me) found 73 things to report -> 100%
    • AI with human (today's me constrained to prompting) found 40 issues to report -> 55 %
    • AI without human (ask to test with playwright) found 4 issues to report -> 5%
    • 57 human testers found on average 13.5 things to report -> 18%
    The test target for this benchmark is a buggy version of ToDo App, hosted at https://todoapp.james.am/ with source code at https://github.com/TheJambo/ToDoInterviewTest and my answer key at https://github.com/QE-at-CGI-FI/todoapp-solution. The whole experiment reminds me of this quote: 

    "AI will not replace humans, but those who use AI will replace those who don’t." - Ginni Rometty, former CEO of IBM

    The way the tasks were done aren't exactly the same.

    • The human tester (past me) had to explore to list features while testing. 
    • The 57 human testers were given a requirement specification based on the exploring the human tester before them modeled making their work essentially different and, I would argue, easier. 
    • AI with human got to use the requirements specification and was able to find 3 more issues with it. 
    • AI with human used 14 prompts to get to the reported level. It was prompted to use the app, the code, the generated test automation, and each one found more unique issues. 
    • AI without human relied on a single prompt to use playwright to list problems. 
    The AI in question is: 
    • GitHub Copilot with Claude Opus 4.5
    • Playwright Agents out-of-box files installed today
    My conclusion: start using AI to do better, even without automation Github Copilot is helpful. That alone won't suffice, you need to learn to ask the right things (explore). You need to be skilled beyond asking to find the things AI won't find for you. The bar on this benchmark moves as I discover even more, or realize I should classify items differently. Not all things worth noting have yet been noted. Testing is never a promise of completeness, but we do need to raise the bar from 18% level where my control group of 57 human testers land.