A Seasoned Tester's Crystal Ball

Learning the hard way, experience

2024-04-22T21:50:00.001+03:00

Over my career, I have done my fair share of good results in testing. There are two signature moves I wanted to write a post on, and what I learned while failing at both of my two signature moves after I first thought I had them pocketed.

These two signature moves are changes I have been driving through successfully over multiple organizations and teams:

Frequent releases
Resultful contemporary exploratory testing with good automation at heart of it

Frequent releases

Turning up the release frequency is an improvement initiative I have done now for over 10 different teams, products, and for four organizations. And the products/teams/orgs I have done this at, only of them are the optimal case of a web application with control over distribution channels. I've been through this the hard hard way - removing reboots from upgrade cycles, integrating processes that were designed for non-continuous such as localization and legal, including telemetry to know if we leave customers hanging on versions - all of that is an article of its own.

Succeeding with this is not the interesting part, except for the latest success of my current team making releases routine after dropping the ball and suffering through a 4 month stabilization period last year. I dare to call it again a success since we made our 5th release this year last week, and included the best scope and process we have had in that.

Why did we fail with a stabilization phase then? We can surely learn something from it.

The experiences point us to failing at communicating and collaborating with a non-technical product owner. How can that create a four month stabilization phase? Leaking uncertainty, and pushing uncertainty to a time it piles up. We leaked uncertainty in functional scope, but also in parafunctional scope, and when pushed to address performance and reliability, we found functional scope we had not recognized existed.

From breaking the main and not realizing how much we had broken it (automation was still passing!) cost us extra stress and uncertainty. We would look at tasks as "I did what you asked", not as "We together did what you did not know to ask".

And when we tested, we failed at communicating our results effectively. New people, new troubles, and they accumulated quickly when we did not want to stop the line to fix before moving forward. We optimized for keeping people busy, over getting things done to a shape where they could be released.

Having my team fail with my signature move taught me that there is much more of implicit knowledge on continuous timely testing than what we capture with automation. Designing for feedback, and making it timely takes learning, and this year we have been learning that with a short leash of releases.

Resultful contemporary exploratory testing with good automation at heart of it

Framing test automation as worthwhile notetaking mechanism while exploring has become so much of a signature move of mine, that I have given it a label of its own: contemporary exploratory testing. When we explore, some of it is attended and some unattended. Test automation, when failing, calls for us to attend. Usually pretty much continuously.

Working with developers, we have learned to capture our notes as lovely English sentences describing the product capabilities in some tests; as developer intent captured in unit tests in some tests; as component tests ad as subsystem, and as system-of-systems tests when those make sense. We have 2500 tests we work with on the new tech stack side, some hand-crafted, and others model-based generated test we use as reliability tests allowing longer unique flows be generated and tracked.

We learn about information we have been missing, and we extend our tests as we fix the problems. We extend the tests with new capabilities. How can all this successful routine be a failure?

It turns out there was a difficult conversation we did not have on time, one of validation. And when a problem is that you are building your product so that it does not address the user needs, none of the other tests you may have built matter. Instead of being a help, they become friction where you will explain how changing those all will be work, at a time when work is less welcome.

This failure comes from two primary sources: carefulness on a cross-organizational communication due to project setup reasons delaying the learning, but as much as that, unavailability of more senior test/dev folks on a previous project delivery.

Both the two failures are successes if we learned and can apply what we learned going forward. For me they taught a career change is in order: I don't want to find myself spread so thin that I don't recognize these expensive mistakes happening. So I will be a tester again, fully. Well, surrounded by testers, as director, test consulting, with a hands on consulting role starting June. Looking forward to new mistakes, and correcting the mistakes I make. Because: mistakes were made, and by me. Only taking ownership allows us to move forward and learn.

* recommendation of a book:

Urban Legends, Fact Checking and Speaking in Conferences

2024-03-17T14:55:00.001+02:00

I consume a lot of material in form of conference talks, and I know exactly the moment when conference talks changed for me forever. It was Scan Agile conference in Helsinki many years ago, and I had just listened to a talk from an American speaker. I enjoyed their experience as told from stage so much that I shared what I had learned with my family. Only to learn the story was fabricated.

In one go, I became suspicious of all stories told from stage. I started recognizing that my stories are lies too, they are me-sided recantations of actual events. While I don't actively go and fabricate them, those who do will tell me that is how human experience and memory work anyway. And the responsibility of taking everything with a grain of salt is on the consumer.

The stage seeks memorable, and impactful. And if that includes creating urban legends, it bothers some popular speakers less than it bothers me.

With that mindset, I still enjoy conference talks but they cause me extra work of fact-checking. I may go and have a conversation to link a personal story to a reference. But more often, it's the other people's stories they share that I searching online on.

In the last two weeks of conferences, I have been fact-checking two stories told from stage. They were told by my fellow testing professionals. And evidence points to word of mouth urban legends on AI.

The first story was about some people sticking tiny traffic signs around to fool machine vision based models. The technique is called adversial attack. There's articles of how small changes with stickers mess up recognition models. There's ideas that this could be done. But with my searches, I did not find the conclusive evidence that someone actually did this in production, in live environments, risking others people's health. I found the story dangerous as it came without warnings of not testing this in production, of the liability of doing something like this.

In addition to searching online for this, I asked a colleague at work with experience of scanning millions of kilometers of imagery of roads where this could be happening. It just so happens I have worked very close to a product with machine vision, and could assess if this was something those in the problem space knew of. I was unable to confirm the story. The evidence points to misdirections from stickers in busses, but not tiny stickers computer sees but human doesn't.

The story was told to about 50 people. Those 50 people, trusting that presenter from the stage would get to apply their fact checking skills before they do what I do now: tell the story forward as it was told, with their flavor of it.

The second story was told by two separate presenters. The story was about someone getting a binding contract on the company letterhead for buying a car for $1 where justice system is still out to decide whether the contract must be withheld. One presented showed it as a reimagined chat conversation translated to Finnish but made no claims on letterhead or legal mitigation, as the more colorful version was already out there presented to this audience.

Fact checking suggests that this is something that someone, specifically Chris Bakke asked from a ChatGPT-based car sales bot, to give an offer for 2024 Chevy Tahoe with "no takesies backsies", far from official offer on company letterhead.

So no binding contract. No company letterhead. No car bought for this price. No litigation pending.

The third story told was a personal one. You can find it as a practical example to analyze a screenshot for bugs. It suggested ChatGPT4 does better on recognizing bugs from an image than people do. While it may not be entirely incorrect on scope of that individual picture, the danger of this story is that the example used in the picture is a testing classic. Myers points out 7.8/14 is what a typical developer scored in 1979, and there's more detailed listings around in literature that show we do even worse in analyzing it and that it is technology-dependent.

Someone else on the conference also suggested we should not read books but ask for summaries from ChatGPT, completely missing a frame of reference on how well then the model would do compared to a human that has read many of these references. Reading less would not help us.

Starting an urban legend, especially now that people are keen on hearing stories of good and bad in the real of AI is easy. Tell a story, they tell it forward. Educational and entertainment value over facts.

So let's finish with a recognition of what I just did with this blog post. I told you four stories. I provided you no names of the people leaving impact significant enough in me to take the time to write this. It's up to you to assess, filter, and fact check and choose which stories become part of what you tell forward. Most information we have available is folklore. There are no absolutes. But there is harm in telling stories from stage, and I would think speakers would need to start stepping up to that responsibility.

Then again, it's theater. Entertainment with a sprinkle of education.

A sample of attended testing

2024-03-06T23:43:00.002+02:00

Today prompted by day 6 of 30 days of AI testing, I tried a tool: Testar. My reasons for giving it a go are many:

Tanja Vos as project leader for the research that generated this would get my attention
I set up a research project at previous employer on AI in / for testing, and this tool's some generation was one of that project's outcomes
The day 6 challenge said I should
Open source over commercial for hobbyist learning attention all the way

I read the code, read the website, and tried the tool. The tool did not crash but survived over an hour "testing" our software with the standards of testing I learned some large software contractors still expect from testers, so I could say it did well. I would not go to as far as "beat manual testing" like the videos on the site did.

The tool was not the point for me though. While the tool run clicking 50 scenarios with 50 actions and what seems to be a lot more than 50 clicks I intertwined unattended testing (the tool doing what the tool does) and attended testing, and I was watching a user interface the tool did not watch for fascinating patterns.

In my test now I have two systems that depend on each other, and two ways of integrating the two systems for a sample value of pressure. And I have a view that allows me to look at the two ways side by side.

As Testar was doing it's best to conclude eventually Test verdict for this sequence: No problem detected, I was watching the impact of running a tool such as this on the integration with the second system.

I noted some patterns:

The values on the left were blinking between no data available and expected values, whereas the values on the right were not. I would expect blinking.
The values on the left were changed with delay compared to the values on the right. I would expect no difference in time of availability.
The bling-sounds of trying something with a warning was connected with the patterns I was observing, and directed me to make visual comparisons sampled across the hour of Testar running.

This is yet another example of why there is no *manual testing* and *automated testing*. With contemporary exploratory testing, this is a sample of attended testing with relevant results, while simultaneously doing unattended testing with no problem detected.

This was not the first time we use generated test cases, but the types of programmatic oracles general enough to note crash, hang, error, those we have had around for quite a while. As soon as we realized it's about programmatic tests, not automated testing. Programmatic tests provide some of their greatest value in attended modes of running them.

For me as a contemporary exploratory tester, automation gives four things:

Running Testar, I have now 50 scenarios of clicking with 50 actions with screenshots on my disk. That alone serves as documenting in ways I could theoretically benefit from if anyone asked me how we tested things. It would most definitely replace the testing that I cut from 30 days investment to 2 days investment last year, and replace one tester.

Testar gave me quick clicking on another system while my real interest was on the other one. Like many forms of automation, time and numbers are how we extend reach.

Testar did not tell me to look at things, but our applications sound alerting did. Looking at the screenshots, I saw a state I would not expect, and went back to investigate it manually. That too was helpful, sampling what I care to attend.

The last piece, guiding to detail I usually get when I don't rely on auto clickers, but actually have to understand to write a programmatic test.

A Bunnycode Case Study for AI in Testing

2024-03-05T22:05:00.001+02:00

It's day 5 of 30 days of AI testing, and they ask for reading a case study or sharing your experience. I did sharing experience already on an earlier day, and in the whim of a moment, set up a teaching example.

I google for obfuscated code in python to find https://pyobfusc.com/. I'm drawn to most reproducible, authored by mindiell and when I see the code, I'm sold. How would you test this?

Pretty little rabbit, right? Reminds me of reading some code at work, work is just less intentional with obfuscation. And really do not have the time or energy to read that. I could test it as black box, learning that given a parameter of a number, it gives me a number:

As if I didn’t know what the rabbit implements or recognize the pattern in the output, I was thinking of just asking ChatGPT about it. However, I did not get that far.

Instead, I wrote def function(): on my IDE while GitHub copilot is on, thinking I would have wrapped the program into a function. It reformatted it to something a bit more readable.

Prompting some more in the context of code.

Comment line “#This function” proposes “is obfuscated”. Duh.

Comment line “#This function imp” proposes "lements the Fibonacci sequence using Binet’s formula.

At this point, I ask chatGPT how to test a function that implements the Fibonacci sequence using Binet’s formula. I get long text saying try values I already tried, but in code, and a hint to consider edge cases and performance. I try a few formats to ask for a value that would make a good performance benchmark, and lose patience.

I google for performance benchmark to learn that this Binet’s formula is much faster than the recursive algorithm, and find performance benchmarks comparing the two.

I think of finalizing my work today with inserting the bunny code into chatGPT and asking “what algorithm does this use” to get second language model generate likely answer as the Binet’s formula. Given the risk and importance of this testing at this time, I conclude it’s time to close my case.

There are so many uses to figure out what it is I am testing (while being aware of what I can share with tool vendors when giving access to code) and this serves as a simulation of idea that you could ask about the pull request. This was the case I wanted to add to the world today.

I should write a real case study. After all, that was one of the accommodations we agreed with multiple levels above me in management when my team at work started GitHub Copilot tryouts some 6 months ago. I should publish this in action, with a real time. As soon as something generates me some time.

List Ways in Which AI is Used In Testing

2024-03-03T19:01:00.002+02:00

I have now been through 3 out of 30 days of AI in Testing by Ministry of Testing. I was awarded my fifth Anniversary Badge, meaning that I have not shown up in that community for a while.

The first day was introduction. Second day was to read an introductory article. Third day asked to list ways AI is used in testing. As with blogging, I filled the paper and wanted to leave the notes from personal experience behind as a blog post.

Practical applications, personal reflection rather than research:

Explaining code. Especially on particularly tired day while being aware that I cannot share secrets, I like to ask chatGPT to explain to me what changes with code in some pull request so that I would understand what to test. Answers vary from useful to hilarious, and overly extensive.

Test ideas for a feature. Whenever I have completed an analysis of a feature to brainstorm my ideas, I tend to ask how chatGPT would recommend me to test it. Works nicely on domain concepts that are not this company only, and I have a lot of that with standards and weather phenomena.

Manipulating statistics. I seem to be bad at remembering excel formulas, but I use a lot of cross-referencing test generated results in excels. ChatGPT has been most helpful in formulas to manipulate masses of data in excel.

Generating input/output data. Especially with Co-pilot, I get data values for parameterized tests. Same test, multiple inputs and outputs generated. More effort into reviewing if I like them and find them useful.

Generating (manual) test cases. I have seen multiple tools do this, and I hate hate hate it. I always turn off steps from test cases and write down only core insights I would want the future me to remember in 3 months.

Generating programmatic tests. Copilot does well on these on unit testing level, but I am not sure I would want all that stuff available. Sometimes helps in capturing intent. But I prefer approvals of inputs and outputs over handcrafted scripts anyway for unit level exploratory testing.

Generating tests based on models. Has nothing to do with AI, but is a pre-AI practice of avoiding writing scripts and working with more maintainable state-based models. Love this, especially for long running reliability testing.

Generating database full of test data. Liked tools of this. I think they are not AI though, even though they often claim they are. The problem of having realistic pattern but not real people’s data is a thing.

Refactoring test code. Works fine at least for robot framework and python tests as long as we first open the code we want to move things from. Trust things to be pre-aware and suffer the duplication. We’ve been measuring this a bit, and copilot seems to encourage duplication for us.

Wrote down few, will revisit when time is right.

Fooled by Microservices, APIs and Common Components

2024-03-02T21:01:00.003+02:00

These days writing software is not the problem. Reading software is the problem. And reading is a big part of the real problem, which is owning software. Last two years has been a particularly challenging experience in owning software, and navigating changes in owning software. I have not cracked it, I am not sure if I will crack it but I have learned a lot.

To set the stage of my experience. Imagine coming to a company with a product created over a period of 20 years. There's a lot of documentation, none of it particularly useful except code. While the shape of the existing product is invisible, you join a team dedicated to modernisation. And the team has already chosen a rewrite approach.

For the first year, the invisible is not a priority. After all, it will be replaced, and figuring out the new thing is enough work as is with a new product. With what feels like heroic effort, you complete the goal of the year with managed compromises. Instead of full rewrite, it's full rewrite of selected pieces. The release is called "Proof of Concept" and it does not survive the first customer contact.

Second year has goals set, to add more functionality on top of the first year. The customer feedback derails goals leading to an entire redesign of the user interface, addressing 9/19 individually listed pieces of feedback. Again what feels like heroic effort, you complete the goal of the year with managed compromises, but now an upwards rather than downwards trend.

The second year starts to give a bit of shape to the existing invisible product, with deliberate actions. You learn you own something with 852k lines of code. It has 6.2% duplication, and 16,4% unit test code coverage. The new thing you've been focusing on has grow into 34k lines of code, with 5% duplication and unit test code coverage of 70%, and great set of programmatic tests that don't hit the unit test coverage numbers.

Meanwhile, you start seeing other trends:

Management around you casually drops expectations of microservices, APIs that allow easily building other products than this one, and common components with expectations of organizational reuse.
You and your team are struggling with explaining that the compromises you took really may not have changed it all to what some people now seem to be expecting,
You realize that what you now live with is four different generations of technological choices no one told the invisible mass would bring you - time adds to your understanding

In one particularly difficult discussion, someone throws microservices at you as the key thing, and you realize that the two sides of the table aren't talking about the same thing.

One party is describing Scripts with flat file integration:

Promised by shadow R&D without promises of support, usually done in scale of days
Automate a specific task
Deployment is file drop on filesystem, and not available if not dropped
Can break with product changes
Output: file or database, usually a file
Expected to not have dependencies
Needs monitoring extended separately
Using files comes with inherent synchronisation problems
Needs improving if scale in insufficient
Can be modified by a user
Batch work, will not be real time
File based processing steps quickly increase complexity

The other party is describing microservices with API integration:

Promised by R&D assuming it will drive product perspective forward, usually in scale of months when deployment risks are addressed
Develop apps that scale
Deployment is with product and feature can be on or off
Protected by design as product capability with product changes
Output: well defined API
Deployed independently / separately with dependencies - API, DB, logic, UI
Follows a pattern that allows for common monitoring
Uses http and message queues as communication protocol
Can be scaled independently
Can't be modified by a user
Can be close to real time data synchronization
Plug in extra processing steps

One party thinks micro is "fast additions of features" and other thinks it is a way of creating common designs.

You start paying more attention of what people are saying and expecting, and realize the pattern is everywhere. Living with so many generations of expectations and lessons makes owning software particularly tricky.

We're often coming to organizations at a time. We're learning as we are moving along. And if we are lucky, we learn sooner rather than later. Meanwhile, we are fooled by the unknown unknowns, and confused by the promises of the future in relation to realities of today.

The best you can do is ground conversations to what there is now, and manage to better from there.

How AI changes Software Testing?

2024-03-01T22:14:00.005+02:00

This week Wednesday, two things happened.

I received an email from Tieturi, a Finnish training company, to respond to the question "How AI changes software testing?".

I went to Finnish Testing Meetup group to a session themed on AI & Testing.

These two events make me want to write two pieces into a single post.

My answer to the question
My thinking behind answering the way I do

My Answer to the Question: How AI changes Software Testing

I know the question is asking me to speculate on the future, but the future is already here, it's not just equally divided - repurposing the quote from a sci-fi author.

Five years ago, AI changed *my software testing* by becoming a core part of it. I tested systems with machine learning. I networked with people creating and testing systems with machine learning. Machine learning was a test target, and it was a tool applied on testing problems for me, in a research project I set up at the employer I was with at that time.

Five years ago, I learned that AI -- effectively applications of machine learning -- are components in our systems, "algorithms" generated from data. I learned that treating systems with these kinds of components as if the entire system is "AI" is not the right way to think about testing, and AI changed my software testing with the reality that it is even more important than ever to decompose our systems into the pieces. These pieces are serving purpose in the overall flow and there's a lot of other things around.

Now that I understood that AI components are probabilistic and not hand-written, I also understood that

the problem is not testing of it, but fixing of it. We had a world where we could fix bugs before. With AI, we no longer had that. We had possibility of experimenting with parameters and data in case those created a fix. We had possibility of filtering out the worst results. But the control we had would never again be the same.

For five years, I have had the privilege of working to support teams with such systems. I was very close on focusing solely on one such team but felt there was another purpose to serve.

Two years ago, AI changed *my software testing* by giving me GitHub Copilot. I got it early on, and used it for hobby and teaching projects. I created a talk and a workshop on it based on Roman Numerals example, and paired and ensembles on use of it with some hundred people. I learned to make choices between what my IDE was capable of doing without it, and it, and reinforced my learning of intent in programming. If you have clarity of intent, you reject stupid proposals and let it fill in some text. I learned that my skills of exploratory testing (and intent in that) made me someone who would jump to identify bugs in talks showing copilot generated code.

These two years culminated 6 months ago into me and my whole team starting to use copilot out our production code after making agreements on accommodations for ethical considerations. I believe erasing attribution for the open source programmers may not be direct violation of copyright, but it is ethically shifting power balance in ways I don't support. We agreed on the accommodations: using work time to contribute on open source projects and using direct money to support open source projects.

One year ago AI changed *my software testing* by access to ChatGPT. I was on it since its first week, suffering through the scaling issues. I had my Testing Dozen mentoring group testing it as soon as it was out, and I learned that the thing I learned in 5 years of AI about decomposing systems, newbies were lost on. From watching that group and then teaching ensembles after that scaling to about 50 people including professional testers in the community, I realized the big change was that testers would need to skill up in their thinking. Noticing it has gender bias is too low a bar. Knowing how you would fix gender bias in data used to teaching would now be required. Saying there's a problem would not suffice for more than adding big blunders to filtering rules. Smart people at scale would fill social media with examples how your data and filtering fixes were failing.

One year ago, I also learned the problems of stupid testing -- test case writing would scale to unprecedented heights with this kind of genAI. I would be stuck in perpetual loop of someone writing text not worth reading. Instead of inheriting 5000 (manual) test cases a human wrote and throwing them away after calculating it would take me 11 full working days to read them with one minute each, I would now have this problem at scale with humans babysitting computers creating materials we should not create in testing in the first place.

Or creating code that is just flat out wrong even if the programmer does not notice with lack of intent on.

AI would change testing to be potentially stuck in the perpetual loop of copy-pasting mistakes in scale and pointing the same ones out in systems. We would be reviewing code not thought through algorithmically. And this testing would be part of our programmer lives because testing this without looking at the code would be non-sensical.

They ask how AI changes Software Testing - it already changed it. Next we ask how people change software testing, understanding what they have at hand.

I have laughed with AI, worked with tricky bugs making me feel sense of powerlessness like never before, learned tons with great people around AI and its use. I have welcomed it as a tool, and worried about what it does when people struggle with asking help, asking help from a source such as this without skills of understanding if what is given is good. I've concluded that faster writing of code or text is not the problem - reading is the problem. Some things are worth reading for a laugh.

AI changed software testing. Like all technology changes software testing. The most recent change is that we use word "AI" to talk about automation of things to get computers acting humanly.

natural language processing to communicate successfully in a human language.
knowledge representation to store what it knows or hears.
automated reasoning to answer questions and to draw new conclusions.
machine learning to adapt to new circumstances and to detect and extrapolate patterns.
computer vision and speech recognition to perceive the world.
robotics to manipulate objects and move about.

I feel like adding specific acting humanly uses cases like 'parroting to nonsense' or 'mansplaining as a service', to fill in the very human space of claims and stories that could be categorised as fake news or fake certainty.

What we really need to work on is problems (in testing) worth applying this for. Maybe it is the popular "routes a human would click" or the "changing locators" problems. Maybe it is the research-inspiring examples of combining bug reports from users with automated repro steps. Maybe it's the choices of not to test all for all changes. We should fill space more with decomposed problems over discussion about "AI".

My thinking behind answering the way I do

This week the people on stage at the meet-up said they are interested yet not experienced in this space. I was sharing some of my actual experience from the audience, as I am retired from public speaking. There is a chance I may have to unretire with a change of job I am considering, but until then I hold space for conversations as chair of events such as AI & Testing in a few weeks, or as loud audience member of events such as Finnish Testing Meetup this week. I don't speak from stage, but I occasionally write, and I always have meaningful 1:1 conversations with my peers over video, the modern global face to face.

I collaborate a lot of different parties in the industry as part of my work-like hobbies. It's kind of win-win for me to do my thing and write a blog post and for someone else to make business out of intertwining my content with their ads. I have said yes to many such request this last month, one of those allowing a Finnish training company Tieturi to nominate me for a competition for the title of "Tester of the Year 2023 in Finland". This award has been handed out 16 times before, and I have been nominated every single year for 16 years, I just asked not to include me by actively opting out after someone had nominated me for 4 or 5 years.

The criteria for this award I have never been considered worthy to win is:

inspiring colleagues or other organizations to better testing
bring testing thoughts and trends from the world to Finnish testing
influenced positively testing culture in their own organization
influenced positively to resultfulness of testing (coverage, found bugs etc) in own organization or community
created testing innovations, rationalising improvements or new kinds of ways to do testing
influenced the establishing of testing profession in Finland
influenced Finnish testing culture and testers profession development positively
OR in other ways improved possibilities to do testing

I guess my 26 years, 529 talks, 848 blog posts in this blog or the thousands of people I have taught testing don't match this criteria. It was really hard to keep going at 10 years of this award, and I worked hard to move past it.

So asking me to freely contribute "How AI changes Software Testing?" as marketing material may have made me a little edgy. But I hope the edginess resulted in something useful for you to read. Getting it out of my system at least helped me.

Contemporary Bug Advocacy

2024-02-27T23:35:00.002+02:00

A few weeks back in Mastodon, Bret Pettichord dusted up a conversation about something we talked about a lot in testing field years ago, Bug Advocacy. Bug advocacy was something Cem Kaner discussed a lot, and a word that I was struggling with translating to Finnish. It is a brilliant concept in English but does not translate. Just not.

Bug advocacy is this idea that there is work we must do to get the results of testing wrapped in their most useful package. A lot of the great stuff on BBST Bug Advocacy course leads one to think it is a bug reporting course, but no, it is a bug research course. A brilliant one at that. Bug advocacy is the idea that just saying it did not work for you does little. It actually has more of a negative impact. Do your research. Report the easiest route to the bug. Include the necessary logs. Make the case for someone wanting to invest time in reading the bug report, even under constraints of time and stress.

Bug advocacy was foundational for me as a learning tester 25 years ago. It was essential at time of publishing Lessons Learned in Software Testing. It was essential when Cem Kaner created the BBST training materials. And it is sometimes like a lost art these days. In other words, it is something people like me learned and practices so long, that we don't remember to teach it forward.

At the same time, I find myself broadcasting public notes in Mastodon and LinkedIn on theme that I would call Contemporary Bug Advocacy - an essential part of Contemporary Exploratory Testing. Like the quote from Kaner's and Pettichord's book says, we kind of need to do better than fridge lights that do all their work while no one is using the results.

Inspired by the last two weeks of testing with my team, I collected a listing of tactics I have employed in contemporary bug advocacy.

Drive by Data. I "reported" a bug by creating data in a shared test environment that made a bug visible. The bug vanished in a day. No report, no conversation. Totally intentional.
Power of Crowd. I organized an ensemble testing session. Bugs we all experienced vanished by end of the day. I have used this technique in the past to get complete API redesign needs by smart use of the right crowd.
Pull request with a fix. I fixed a bug, and sent the fix for review for the developer. Unsurprisingly, a fix can be more welcome than a task to fix something.
Silent fix. I just fix it so that we don't have to talk about it. People notice changes with their routines of looking at what is going on in the code.
Pairing on fix. I refused to report, and asked to pair on the fix for me to learn. Well, for me to teach too. Has been brilliant way of ramping up knowledge of problems dealing with root causes rather than repeated symptoms.
Holding space for fix to happen. A colleague sat next to me while I had not done a simple thing, making it clear they were available to help me but would not push me to pairing.
Timely one liner in Jira. I wrote title only bug report in Jira. That was all the developer needed to realize they could fix something, and the magic was this all happened within the day of the bug being created while they were still in context.
Whisper reporting. I mentioned a bug without reporting it. Developers look great when they don't have bug reports on them. I like the idea of best work winning when we care about the work over credit. Getting things fixed is work, claiming credit with report is sometimes necessary but often a smell.
Failing test. Add a failing test to report a bug, and shift work from there. Great for practicing very precise reporting.
Actual bug report. Writing great summary, minimal steps to repro and making clear your expected and actual results. Trust it comes around, or enhance your odds by other tactics.
Discuss with product owner. Your bug is more important when layered with other people's status. I apply this preferably before the report for less friction.
Discuss with developer. Showing you care about colleagues priorities and needs enhances a collaboration.
Praise and accolades. Focus your messaging on the progress of vanishing, not emerging. Show the great work your developer colleagues are doing. So much effort and so little recognition, use some of your energy in balancing to good.
Sharing your testing story. Fast-forward your learning and make it common learning. A story of struggles and insights is good. A shared experience is even better.
Time. Know when to report and how. Timing matters. Prioritise their time.

All of these tactics are ways to consider how to reduce friction for the developer. Advocate. Enable. Help. Do better than shine light when the door is closed.

I call this contemporary because writing a bug report is simple. It is basic. But adding layers of tactics to it, that is far from simple. It is not a recipe but a pack of recipes. And you need to figure out what to apply when.

I found nine problems yesterday. I applied four different tactics on those nine problems. And I do that because I care about the results. Results of testing is information we have acted upon. Getting the right things fixed, and getting the sense of accomplishment and pride for our shared work in building a product.

Everyone can test but their intent is off

2024-02-21T21:43:00.002+02:00

Over my 8 years of ensemble and pair testing as primary means of teaching testers, I have come to a sad conclusion. Many people who are professionally hired as testers don't know how to test. Well, they know how to test but from their testing, there is a gaping results gap. Invisible one. One they don't manage or direct. And the sad part is that they think it is normal.

If you were hired to do 'testing' and you spent all your days doing 'testing', how dare I show up to say your testing is off?!? I look at results, and the only way to look at results you provide is to test after you.

My (contemporary) exploratory testing foundations course starts from a premise of giving you a tiny opportunity to assess your own results, because the course comes with the tool of turning invisible ink to visible, that is listing of problems me (and ensembles with me) have found across some hundreds of people. I used to call it 'catch-22' but like usual with results of testing, more work on doing better has grown the list to 26.

Everyone can test, like everyone can sing. We can do some slice of the work that resembled doing the work. We may not produce good enough results. We may not produce professional results that lead into paying for that work. But we can do something. Doing something is often better than doing nothing. So the bar can be low.

An experience at work leads me to think that some testers can test but their intent is so far off that it is tempting to say they did not test. But they did test, just the wrong thing.

Let me explain the example.

The feature being tested is one of sending pressure measurement values from one subsystem to another, to be displayed there. We used to have a design where the value of the first subsystem that was used on its user interfaces after various processing steps was sent forward to the second subsystem. Then we mangled that value because it followed no modern principles of programmatically processable data so that we could reliably show the value. We wanted to shift the mangling of the data to the source of the data, with information about the data, just in a beautiful modern wrapper of a consistent data model.

Pressure measurement was the first in the line of beautiful modern wrappers. The assignment for the tester was to test this. Full access to conversations with the developer was available. And some days after the developer said "done and tested", the tester also came back with "tested!". I started asking questions.

I asked if one of the things they tested was to change configurations that impact pressure values so that they can see differences of pressure at sea level (a global comparison point) and at the measurement location. I got an affirmative response. Yet I had a nagging feeling, and built on the yes inviting the whole team to create a demo script for this pressure end to end. Long story short, not only did the tester not know how to test it, it was not working. So whatever they called testing was not what I was calling testing. The ensemble testing session also showed that NO ONE in the team knows the feature end to end. Everyone conveniently looks at a component or at most a subsystem. So we all learned a thing or two.

Equipped with the information from the ensemble testing experience, the tester said to take more time testing before coming back with "tested!". They did, and today they came back - 7 working days later. I am well aware that this was not the only thing they had on their agenda - even if their agenda is on their full control - and we repeated the dance. I started asking questions.

They told me they updated the numbers in the model we created for the ensemble testing session. I was confused - what numbers? That model sketched early ideas of how three height parameters would impact the measurements in the end, but it was a quick sketch from an hour of work, not a fill in the ready blanks model. So I asked for demo to understand.

I was shown how they changed three parameters of height. Restart the subsystem. Basic operations of the subsystem they have been testing for over a year. The interesting conversation was on what did they then test. It turns out they did the exact same moves on the subsystem on a build without the changes and on a build with the changes. They concluded that since the difference they see is in sending the data forward but the data is the same, it must be right. Regression oracle. But very much partial oracle, and partial intent.

In the next minutes, I pulled up a 3rd party reference and entered the parameters to have comparable values. We learned that they had the parameters wrong, because if the values aren't broken with the latest change, then the change of configuration is likely incorrect. They did not explore values for plausibility, and they were way off.

I asked to show what values they compared, to learn they chose an internal value. I asked to pull up the first subsystem's user interface for the comparable values. Turns out that the value they compared is likely missing multiple steps of processing to become the right value.

For junior testers such as this one, I expect I will coach them by having these conversations. I have been delighted with picking up information as new information comes, and I follow the trend of not having to cover same ground all the time. I understand how this blindness to result comes about: in most cases testing a legacy system, the regression oracle keeps them on path. In this case, it leads them to the wrong path. They only took ISTQB course which does not teach you testing. I am their first person to teach this stuff. But it is exhausting. Especially since there are courses like mine or like BBST that they could learn from, and provide the results for the right intent. Learn to control their intent. Learn to *explore*.

At the same time, I am thinking that all too often juniors to testing - regardless of their years of experience - could learn slightly more before becoming same costs as developers. This level of thinking would not work for a junior developer.

Testers get by because they can be just without value. Teamwork may make them learn or become negative value. But developers don't get by because the without value of their work gets flagged sooner.

Making Releases Routine

2024-02-14T20:28:00.002+02:00

Last year I experienced something I had not experienced for a while: a four month stabilisation period. A core of the work of testing-related transformations I had been doing with three different organizations was to bring down release timeframes, from these months long versions to less than an hour. Needless to say, I considered the four month stabilisation period a personal fail.

Just so that you don't think that you need to explain me that failing is ok, I am quite comfortable with failing. I like to think back to a phrase popularised by Bezos 'working on bigger failures right now' - a reminder that too safe means you won't find space to innovate. Failing is an opportunity for learning, and inevitable when experimenting in proportion to successes.

In a retrospecting session with the team, we inspected our ways and concluded that taking many steps away from a good known baseline with insufficient untimely testing, this is what you would get. This would best be fixed by making releases routine.

There is a fairly simple recipe to that:

Start from a known good baseline
Make changes that allow for the change you want for your users
Test the changes in a timely fashion
Release a new known good baseline

The simple recipe is far from easy. Change is not easy to understand. And it is particularly difficult if you only see the change in small scale (code line) and not in system (dependencies). And it is particularly difficult if you only see the system but not the small scale. In many teams developers have been pushed too small, and testers have not been pushed small enough. This leads to delayed feedback because the testing done in timely fashion misses results in testing that starts to lag behind from changes.

In addition to results gap on information we need and information we have and its time dimension, the recipe continues with release steps. Some include all of the results gap in release tests because testing can't learn to be timely, muddling the waters of how long it takes to do a release. But even when feature and release testing are properly separated, there can be many steps.

In our team's efforts of making releases routine, I just randomly decided this morning that today is a good day for release. We have a common agreement that we would do release AT LEAST once a month, but if practice is what we need, more is better. For various reasons, I had been feature testing changes as they come. I have two dedicated testers who were already two weeks behind on testing, and if I learned something from last year's failing, it's that junior testers have less developed sense of timing of feedback, partially rooted in the fact that skills in action need rehearsing at slower pace. Kind of like learning to drive a car, slow down while turning the wheel and looking around are hard to do at the same time! I was less than an hour away from completing feature testing at time of deciding for the release.

I completed testing - reviewed the last day of changes, planned tests I wanted, and tested those. All that was remaining then was the release.

It took me two hours more to get the release wrapped up. I listed the work I needed to do:

Write release notes - 26 individual changes to message worth saying
Create release checklist - while I know it by heart, others may find it useful to tick off what needs doing to say its done
Select / design title level tests for test execution (evidence in addition to TA - test automation)
Split epics to this release - other release so that epics reflect completed scope over aspirational scope. and can be closed for the release
Document per epic acceptance criteria, esp. out of scope things - documentation is an output not input, but if I was testing, it was a daily output not something to catch up at release time
Add Jira tasks into epics to match changes - this is totally unnecessary but I do that to keep a manager at bay, close them routinely since you already tested them at pull request stage
Link title level tests to epics - again something normally done daily as testing progresses, but this time was left outside the daily routine
Verify traceability matrix of epics ('requirements') to tests ('evidence') shows right status
Execute any tests in test execution - optimally one we call release testing and would take 15 minutes on the staging environment
Open Source license check - run license tool, compare to accepted OSS licenses and update licenses.txt to be compliant with attribution style licenses
Lock release version - Select release commit hash and lock exact version with a pull request
Review Artifactory Xray statistics for docker image licenses and vulnerabilities
Review TA (test automation) statistics to see it's staying and growing
Press Release-button in Jira so that issues get tagged - or work around reasons why you couldn't do just that
Run promotion that makes the release and confirm the package
Install to staging environment - this is something from 3 minute run a pipeline to 30 minutes do it like a customer does it
Announce the release - letting others know is usually useful
Change version for next release in configs

This took me about 2 hours. I skipped the install to staging though. And I have a significant routine in all these tasks. What I do in a few hours, the experience shows it takes about a week when moved forward and about a day for me in answering questions. Not a great process.

There are things that could be done to start a new release, in conjunction with Change version for next release:

Create release checklist

There are things that should become continuous on that list:

Select / design title level tests
Split epics
Document per epic acceptance criteria
Add Jira tasks into epics to match changes
Link title level tests to epics
Verify traceability matrix
Execute any tests in test execution

There's things that could happen on a cadence that have nothing to do with releases:

Review Artifactory Xray statistics
Review TA (test automation) statistics

And if we made these changes, the list would look a lot more reasonable:

Write release notes
Execute ONE test in test execution
Open Source license check
Lock release version
Press Release-button in Jira
Run promotion that makes the release
Install to staging environment
Announce the release
Change version for next release

And finally, looking at that list - there is no reason why all but the meaningful release notes message can't happen on ONE BUTTON.

I like to think of this type of task analysis visually:

This all leaves us with two challenges: extending the release promotion pipeline to all the chores and the real challenge: timely resultful testing by people who aren't me. Understanding that delta has proven a real challenge now that I am not a tester (with 26 years of experience coined into contemporary exploratory testing) and a significant chunk of my time is on my other role: being a manager.

The Power of Framing

2024-01-19T19:22:00.007+02:00

Sometimes, we write on topics we have not researched, but still have things to say on. This is how I frame this post: I am not an expert in framing. There is admirable levels of eloquence, excellent teaching materials I have seen, but my practice of this is one of a learner.

Me setting the stage of the post is framing. You put a perspective around a thing, that allows you to see the thing. It might be that you are framing to see things in a similar light, or you might use framing to change the narrative on a topic. Today I had two examples in mind that I wanted to make a public note of.

Example 1: "I'm a bad direct report" --> "I'm an employee with entrepreneurial touch"

In a conversation about managing up - managing your manager - we ended up talking about understanding what is important to you, and that what you seek may differ from what others seek. I expressed importance of agency, the sense that I am on the controls of my work and expectations for my work are things I negotiate, rather than take as givens. When someone violates my agency, I react strongly.

This lead to someone else sharing how they consider themselves "bad direct report" and have chosen entrepreneurship where traits like established structures and rules, obeying without the why, questioning why, asking questions for deeper understanding, and built in drive for the better are helpful and welcome.

I recognised the sentiment and similarity to how I think, yet noted my choices have turned very different. I have chosen to join organizations and swim against the stream. As reaction to the story, I realized I frame the same story as "I am an employee with entrepreneurial touch", and "rebel at work", who can move mountains for organizations if they appreciate the likes of me.

I did not even think of this as framing, I just shared how I have placed a frame on something that I could choose to frame, very realistically that I am many bosses nightmare. I don't obey, I seek goals and I motivate routines through playing fairly with others as I don't think I am entitled to break flows other people rely on, without taking them along for the change ride.

Example 2: "This company has not given me training for 2 years" --> "Learning matters"

Another sample of framing was inspired by noticing a frame of describing a true experience: "This company has not given me training for 2 years". To see the frame, we note the definition of training.

Training in this case is not taking the compulsory e-learning courses where even a manager is required to check you have completed the training.

Training is not taking online courses that have an immediate impact on the work ongoing.

Training is money out of companies' pocket to send me somewhere I could not go with my salary. It is a salary surplus.

To frame this for myself, I drew a picture 2 months ago, that I then shared on social media.

If I frame training to learning, I can see a variety of options. I can see that while the visible money on my learning may not match my expectations, my experience of being allowed to use invisible money (doing work slower) has definitely been an investment to my learning.

The picture includes a few "Done is 2023" ticks with visible money, as I got to go to EuroSTAR and HUSTEF (as keynote speaker, company paying my daily allowance). That's not where I learned though. I learned the most in 2023 from push benchmarks, meaning I shared how we work / what our problem is, and got guidance on things other people tried - the community approach.

The same learning framing affords me agency. Instead of "no training given", I can assess "learning I made space for, with support/roadblocks".

Framing changes how I feel about the same true experience. And how I feel about it changes what I can do.

Model-Based Testing in Python

2023-12-08T21:02:00.001+02:00

Some of the best things about being a tester but also a manager is that I have more direct control over some of the choices we make. A year ago I took part in deciding that my main team will have no testers, while my secondary team had two testers. Then I learned secondary team did some test automation not worth having around and fixed that by hiring Ru Cindrea, co-founder of Altom, to help the secondary team. She refactored all of the secondary team's automation (with great success I might add). She took over testing an embedded product the secondary team had ownership on (with great success again). And finally, I moved her to my primary team to contribute on a new concept I wanted since I joined Vaisala 3,5 years ago: applying model-based testing. This all has happened in a timeframe of one year.

Some weeks back, I asked her to create a presentation on what we have on model-based testing and what that it about, and she did that. There is nothing secret in this type of testing, and I was thinking readers of this blog might also benefit from a glimpse into the concept and experience.

The team has plenty of other automated tests, traditional hand-crafted scripts in various scopes of the test target. These tests though, they are in scope of the whole system. The whole system means two different products with a dependency. The programmatic interfaces include a web user interface, a remote control interface to a simulator tool, and an interface to the operating system services the system runs on. This basically means a few python libraries for purposes of this testing, for each interface. And a plan to add more interfaces for control and visibility at different points of observation of the system.

In explaining why of model-based testing, Ru used an example of a process of how we give control and support for people observing weather conditions with the product to create standard reports. In the context we work with, there are reports that someone is creating on a cadence, and these reports then get distributed globally to be part of a system of weather information. Since there's an elaborate network that relies on cadence, the product has features to remind about different things that need to happen.

Choosing some states as screenshots of what a user sees, showing why model-based testing became a go-to approach for us makes sense. Traditionally we would analyse to draw these arrows, and then hand-craft test scenarios that would go through each arrow, and maybe even combinations of a few of the most important ones. Often designing those scripts was tedious work, and not all of us were visually representing them. We may even have relied on luck on what ended up as the scripts, and noting some of the paths we were missing through the exploratory testing sessions where we let go of the preconceived idea of the paths.

The illustration of screenshots is just a lead in to the model. The model itself is a finite state model with nodes and vertices, and we used AltWalker (that uses GraphWalker) on creating the model.

The model is just a representation of paths, and while visual, it is also available as a structured text file. Tests are generated based on the model to cover paths through the model, or to walk through the model even indefinitely.

For the application to respond, we needed some code. For the webUI part, we use selenium, but instead of building tests, we build something akin to POM (page object model) with functions to map with the actions (arrows) and checks (boxes) in the model.

For use of Selenium, we have three main reasons. One is that Playwright that we have used otherwise on UI tests in this product does not play with AltWalker (for now, I am sure it is pending to change) due to some dependency conflicts. The other that I quite prefer, is that Selenium WebDriver is the only driver for real browsers to the level of what our users use. It gives us access to real Safari, real and specific versions of Chrome and Firefox, instead of an approximation that works nicely for most application-level testing stuff we do with playwright. The third reasons is that while Ru learned Playwright in some days to no longer find it new, she has decade on Selenium and writes this stuff by heart. The same days of effort suffices for the others on playwright to learn enough of Selenium to work on these tests.

We had other python libraries too, but obviously I am more comfortable sharing simple web page actions than showing details of how we do remote control of complex sensor networks and airport might have.

In the end of Ru's presentation, she included a video of a test run with the visual showing ramping up the coverage. We watched boxes Turing from grey to green, our application doing things as per the model, and I was thinking about how I keep doing exploratory testing sometimes by looking at the application run, and allowing myself to pick up cues from what I see. For purposes of this article, I did not include the video but a screenshot from it.

In the end of the short illustration, Ru summed up what we have gained with model-based testing.

It's been significantly useful in reliability testing we have needed to do, in pinpointing performance-oriented reliability issues. The reliability of the tests has been a non-issue.

A system like this runs for years without reboots, and we need a way of testing that includes production-like behaviors with sufficient randomness. A model can run until it's covered. Or a model can run indefinitely until it's stopped.

The obvious question from people is if we can generate the model from the code. Theoretically perhaps in a not distant future. But do we rather draw and think, or try to understand if a ready drawing really represents what we wanted it to represent. And if we don't create a different model, aren't we at risk of accepting all the mistakes we made and thus not serving the purpose we seek for testing it?

We hope the approach will enable us to get a second organization acceptance testing our systems to consume test automation with its visual approach, and avoid duplication in organizational handoff, and to increase trust without having to rebuild the testing knowledge to same levels as with an iteratively working software development team would have.

While the sample is one model, we already have two. We will have more. And we will have combinations of feature models creating even more paths we no longer need to manually design for.

I'm optimistic. Ru can tell what it is like to create this, and she still speaks at conferences while I do not. I'm happy to introduce her in case you can't find her directly.

GitHub Copilot Demos and Testing

2023-11-21T01:11:00.000+02:00

I watched a talk on using generative AI in development that got me thinking. The talk started with a tiny demo of GitHub Copilot writing a week number function in Javascript.

This was accepted as is. Later in the talk, the speaker even showed a couple of tests also created with the help of the tool. Yet the talk missed an important point: this code has quite significant bug(s).

In the era of generative AI, I could not resist throwing this at ChatGPT with prompt:

Doesn't this give me incorrect week numbers for early January and leap years?

While the "Yes, you're correct." does not make me particularly happy, what I learn in the process of asking that this is an incorrect implementation of something called ISO week number, and I never said I wanted ISO week number. So I prompt further:

What other week numbers are there than ISO week numbers?

I learn there's plenty - which I already knew having tested systems long enough.

The original demo and code came with the claim of saving 55% of time, and having 46% of code in pull requests already being AI generated. This code lead me down an evening rabbit hole with a continuous beeping from mastodon by sharing this.

Writing wrong code fast is not a positive trait. Accepting - and leading large audiences to accept this - is making me angry.

Start with better intent. Learning TDD-like thinking actually helps.

Writing wrong code fast. Writing useless documentation fast. Automating status updates in Jira based on other tools programmers use. Let's start with making sure that the thing we are automating is worth doing in the first place. Wrong code is't. Documentation needs readers. And management can learn to read pull requests.

Secret sauce to great testing is to change the managers

2023-11-16T21:17:00.003+02:00

I had many intentions going into 2023, but sorting out testing in my product team was not one of them. After all, I had been the designated testing specialist for all of 2022. We had a lovely developer-tester collaboration going on, and had made agreement on not hiring a tester for the team, at all. Little did I know back then.

The product was not one team, it was three teams. I was well versed on one, aware (and unhappy with) another, and oblivious to third. By end of this year, I see the three teams will succeed together, as they would have failed together not paying attention to the neighbours.

Looking at the three teams, I started with a setup where two teams had no testers, and one team had two testers. I was far from happy with the team with two testers. They created a lot of documentation, and read a lot of documentation, still not knowing the right things. They avoided collaboration. They built test automation by first writing detailed scenarios in jira tickets, and then creating a roadmap of those tickets for an entire year. You may not be surprised that I had them delete the entire roadmap after completing the first ticket from that queue.

So I made some emergent changes - after all, now I was the manager with the power.

I found a brilliant system expert in the team I had been unaware of, and repositioned them to the designated testing position I had held the previous year with assumption of complete career retraining. Best choice ever. The magic of seeing problems emerged without me having to do all of that alone.

I tried asking for better testing with the two testers I had, and watching the behaviors unfold I realized I had a manager problem. I removed two managers from between me and the tester that had potential, and have been delightedly observing how they know excel in what I call contemporary exploratory testing. They explore, they read (developer owned) automation, add to shared scope of automation what they learned should be in by exploring, and have conversations with the developers instead of writing them messages in Jira.

I brought in two people from a testing consultancy I have high respect for, Altom. I know they know what I know of good testing, and I can't say that of most test consultancies. First consultant I brought in helped me by refactoring the useless test automation created by those scenarios in jira tickets instead of good information targeting thinking. We created ways of remotely controlling the end to end system, and grew that from one team to two teams. We introduced model based testing for reliability purposes, and really hit something that enabled success through a lot of pain this year. A month ago, I brought in a second person from Altom, and I am already delighted to see them stepping up to cross team system test responsibilities clarification while hands-on testing.

So in total, I started off with idea of two traditional testers I'd rather remove, and ended up with four contemporary exploratory testers who test well, automate well on many scopes and levels, and collaborate well. In exchange, I let go a developer manager with bad ideas of testing (forcing idea of *manual testing with test cases* I had hard time getting away from) and a test lead following the advice others gave over advice I gave.

We get so much more information to react on, in so much more actionable way this way. I may be biased as I dislike the test case based approaches to the core, but I am very proud of the unique perspectives these four bring in.

Looks like our secret sauce to great testing was to change the managers. Your recipe may vary.

Across two teams, I facilitated a tester hat workshop. Tester is something we all are. But specialist testers often seek for the gaps the others leave behind. Wanting to think you are something does not yet mean you are skilled as that. Testing is many hats, and being a product and domain expert capturing that knowledge in executable format is something worth investing in to survive with products that stay around longer than we do.

Planning in Bubbles

2023-11-15T21:52:00.001+02:00

We have all experienced the "process" idea to work we are doing. Someone identifies work and gives it label, and uses some words to describe what this work is. That someone introduces the work they had in mind for the rest of us in software development team, and then we discuss what the work entails.

Asking for something small and clear, you probably get clearer responses. But asking for something small and clear, you are also asked for vision, and what is around those small and clear things that we should be aware of.

Some teams take in tasks, where work that wasn't asked for comes in as next task. Others carry more of a responsibility by taking in problems, discovering the problem and solution space, and carry somewhat more ownership.

For a long time, I have been working with managers who like to bring tasks not problems. And I have grown tired of being a tester (or developer) in a team where someone thinks they get to chew things for me.

A year ago - time before managering - I used John Cutler's Mandate Levels model as ways of describing the two teams I was interfacing then, and how essentially they had different mandates.

In a year of managering, team B now works on mandates A-C, and team A now works only on A-E, no matter how much I would like us to move all the way to F.

With the larger mandate level and success of hitting good forward taking steps in measuring cost per pull request (and qualitatively assessing direction of steps), team A has not needed or wanted Jira. We gave up on it 1,5 years ago, cut cost of change to a fifth, and did fine until we did not. Three months ago I had no other choice but to confess to myself that with a new role and somewhat less hands-on testing on my days, we had accrued quality debt - unfinished work - in scales we needed Jira to manage it. Managing about half of it in Jira resulted in 90 tickets, and we still continued the encouragement of not needing a ticket to do a right thing.

Three months later, we are a few items short of again having a baseline to work on.

With the troubles accrued, I have theories of what causes that, and I call that theory the modeling gap. The request of a problem coming in leaves so much open for interpretation, that without extra care, we leave tails of unfinished work. So I need a way of limiting Work in Progress, and a way of changing the terms of modeling to language of majority (the team) for language of minority (product owner).

I am now trying out an approach I call The Team Bubble Process. It is a way of visualising work ongoing with my large mandate team where instead of intaking tasks, we describe (and prioritise) tasks we have as a team for a planning period. We don't need the usual step by step process, but we need to show what work is in doing, and what is in preparing to do.

The first bubble is doing - coding, testing, documenting, something that will see the customer and change what the product does for the customer. All other work is in support of this work.

The second bubble is planning, but we like to do planning by spiking. Plan a little, try a little. Plan a little, try more. It's very concrete planning. Not the project plans, but more like something you do when you pair or ensemble.

The third bubble is preplanning, and this is planning that feeds doing the right things. We are heavily tilted towards planning too much too early, which is why we want to call this preplanning. It usually needs revisiting when we are really available to start thinking about it. It's usually the work where people in support of the team want to collaborate on getting their thoughts sorted, typically about roadmap, designs and architectures.

The fourth bubble is preparing. It's yet another form of planning, now the kind of planning that is so far from doing that done wrong can be detrimental. Done well, can be really helpful in supporting the team. Work of prepare often produces things that the team would consider outside the bubbles, just to be aware.

I have now two sets of two-weeks modelled.

For first we visualized 8 to Do, and completed 4, continue on 4; 4 to Plan, and continue on all yet never starting 1; 1 to Preplan we, and never started it; 11 to Prepare, completed 3, continue on 4 and never started 4.

For second we visualized 14 to Do, 14 to Plan, 6 to Preplan, and 9 to Prepare.

What we already learned is that we have people who hope their responsibility and collaboration with team on something they think they'd like to prepare (but too early) will do work outside the bubbles and not be keen on sharing the team agenda. It will be fascinating to see if this helps us, but at least it is better than pretending Jira "epics" and "stories" are useful way of modeling our work.

Anyone can test but...

2023-10-25T21:15:00.002+03:00

Some years ago, I was pairing with a developer. We often would do work on the code that needed creating, working on the developer home ground. So I invited us to try something different - work that was home ground to me.

There was a new feature the other developers had just called complete, that I had not had time to look at before. No better frame of cross-pollinating ideas, or so I thought.

Just minutes into our pairing session of exploring the feature, I asked out loud something about sizes of things we were looking at. And before I knew, the session went from learning what the feature might miss for the user to the developer showing me bunch of tools to measure pixels and centimeters, after all, I had sort of asked about sizes of things.

By end of pairing session, we did not do any of the work I would have done for the new feature. I was steamrolled and confused, even deflated. It took weeks or months and a conference talk to co-deliver on pairing before I got to express that we *never* paired on my work.

I am recounting this memory today because I had a conversation with a new tester in a team with one of testerkind, and many of developerkind. The same feelings of being the odd one out, being told (and in pairing directed) to do the work the developer way resonated. And the end result was an observation on how hard it it to figure out if you are doing good work as tester when everyone thinks they know testing, but you can clearly see the results gap showing their behaviors and results don't match their idea of knowing testing so well.

I spoke about being the odd one out in TestBash in 2015. I have grown up as tester with testing teams, learning from my peers, and only when I became the only one with 18 developers as my colleagues, I started to realize that I had team members and lovely colleagues, but not a single person who could direct my work. I could talk to my team members to find out what I did not need to do, but never what I need to do. My work turned to finding some of what others may have missed. I think I would have struggled without having had those past colleagues who had grown me into what I had become. And I had become someone who reported (and got fixed) 2261 bugs over 2,5 years before I stopped reporting bugs and started pairing on fixes to shift the lessons to where they matter the most.

For a better future, I hope two things:

Listen and make space more. Look for that true pairing where both contribute, and sit with the silent moments. Resist the temptation of showing off. You never know what you would learn.
Appreciate that testing is not one thing, and it is very likely that the testing you do as developer is not the same as what someone could add on what you did. Be a better colleague and discover the results gap, resisting the refining the newer person to a different center of focus.

I searched my peers and support in random people of the internet. I blogged and I paired like no other time when I was the odd one out. And random people of the internet showed up for me. They made me better, and they held me on growth path.

If I can be your random people of the internet to hold you on growth path, let me know. I'm ready to pay back what people's kindness has given me.

Anyone can test. Anyone can sing. But when you are paid to test as the core of your work week, coverage and results can be different to what they would be if you needed to squeeze it in.

How many test cases do we still have to execute?

2023-10-11T21:07:00.003+03:00

In a one on one conversation to understand where we are with our release and testing efforts, a manager asked me the infamous question that always makes me raise my eyebrows:

How many test cases do we still have to execute?

This question was irrelevant 20 years ago when it send me fuming and explaining, and it is still irrelevant today. But today it made me think about how I want to change the frame of the conversation, and how I manage to now almost routinely do it.

I have two framings:

The 'transform for intent while hearing the question' framing
The 'teach better ways of asking so that you get the info when not talking to me' framing

It would be lovely if I was big enough person to always go with the first framing. I know from years of having answered this question that the real question is one about schedule and how testing supports it. What they really ask me is twofold: a) How much work do we need to do to have testing done? b) Can we scale it to make it faster? This time the answer in the moment I gave was in the first framing. I explained that I would think since September 1st, we've made it to about 40% of what we need in time and effort. I illustrated actions we take as a team to ensure fixing which is the real problem supports testing, and how the critical knowledge of the domain specialist tester is made more available by team members pitching in to other work domain specialist could opt in. I reminded that my estimate of having to find and log 150 bugs based on a day of early sampling and years of profiling projects and that we are at 72 reported now, which indicates about half way through.

I recognised the question was really not about test cases. We sort of don't have them for the kind of testing the manager asked on. Or if I want to claim we have them for compliance purposes, they are high level ones and there are 21 of them, and only one of them is "passed".

The second framing is one that I choose occasionally, and today I chose it in hindsight. If I had my statistics in place and memory, I could have also said something akin to this:

We do two kinds of testing. Programmatic tests do probably 10% of our testing work and we have 1699 of them. They are executed 20...50 times per working day, and are green (passing) on merge.

These 1699 test cases miss out things the 21 high level test cases allow us to identify. And with those 21 high level test cases, we are about 40% into the work that is needed.

There are real risks to quality we have accrued in recent times, particularly on controlling our scope (even if they say scope does not creep but understanding grows, we have had a lot coming in as must fix that are improvements and grow scope for entire team), reliability and performance. Assessing and managing these risks is the work that remains.

Other people may have other ways of answering this question, have you paid attention what are your framings?

A NoJira Health Check

2023-08-01T23:01:00.001+03:00

One of my favorite transformation moves these days is around the habits we have gathered around Jira, and I call the move I pull NoJira experiment. We are in one, and have been for a while. What is fascinating though is that there is a lifecycle of it that I am observing, live, and wanted to report some observations on.

October 5th, 2022 we had a conversation on NoJira experiment. We had scheduled a training on Jira for the whole team a few days later, that with the team deciding on this one then promptly cancelled. The observation leading to the experiment was that I, then this team's tester, knew well what we worked on and what was about to be deployed with merging of pull requests, but the Jira tasks were never up to date. I found myself cleaning them up to reflect status and the number of clicks to move tasks through steps was driving me out of my mind.

Negotiating a change this major, and against the stream, I booked a conversation with product ownership, manager of the team and director responsible for the product area. I listened to concerns and wrote down what the group thought.

Uncontrolled, uncoordinated work

Slower progress, confusion with priorities and requirements

Business case value get forgotten

We have not been using Jira since. Not quite a year for this project, but approaching.

First things got better. Now they are kind of messy and not better. So what changed?

First of all, I stopped being the teams' tester and became the team's manager, allocating another tester. And I learned I have ideas of "common sense" that is not so common without having good conversations to change the well-trained ideas of "just assign me my ticket to work on".

The first idea that I did not know how to emphasize is that communication happens in 1:1 conversations. I have a learned habit of calling people to demo a problem and stretching to fix within the call. For quality topics, I held together a network of addressing these things, and writing a small note on "errors on console" was mostly enough for people to know who would fix it without assigning it further. It's not that I wrote bug reports on a confluence page instead of Jira. I did not just report the bugs, I collaborated on getting them fixed by knowing who to talk to and figuring out how we could talk less in big coordinating meetings.

The second idea that I did not know to emphasize enough is that we work with zero bugs policy and we stop the line when we collect a few bugs. So we finish the change including the unfinished work (my preferred term to bugs) and we rely on fix and forget - with test automation being improved while fixing identified problems.

The third idea I missed in elaborating is that information prioritisation is as important as discovery. If you find 10 things to fix, even with a zero bug policy, you have a gazillion heuristics of what conversations to have first over throwing a pile of information suffocating your poor developer colleagues. It matters what you get fixed and how much waste that creates.

The fourth idea I missed is that product owners expectations of scope need to be anchored. If we don't have a baseline of what was in scope and what not, we are up for disappointments on how little we got done compared to what wishes may be. We cannot neglect making the abstract backlog item on product owners external communications list into a more concretely scoped listing.

The fifth idea to conclude my listing in reflection is that you have to understand PULL. Pull is a principle, and if it is a principle you never really worked on, it is kind of hard. Tester can pull fixes towards a better quality. Before we pull new work, we must finish current work. I underestimated the amount of built habit in thinking pull means taking tasks someone else built for your imagined silo, over together figuring out how we move the work effectively through our system.

For first six months, we were doing ok on these ideas. But it is clear that doubling the size of the team without good and solid approach of rebuilding or even communicating the culture the past practice and success of it is built on does not provide the best of results. And I might be more comfortable applying my "common sense" when the results don't look right to me in the testing space than those coming after me.

The original concerns are still not true - the work is coordinated, and progresses. It is just that nothing is done and now we have some bugs to fix that we buried in confluence page we started to use like Jira - as assigning / conversing / reprosteps listing, over sticking with the ideas of truly working together, not just side by side.

Public Speaking Closure

2023-07-29T14:05:00.002+03:00

Very long time ago, like three decades ago, I applied for a job I wanted in the student union. Most of the work in that space is volunteer-based and not paid, but there are few key positions that hold the organizations together, and the position I aspired to have was one of those.

As part of the process of applying, I had to tell about myself on a large lecture hall, with some tens of people in the audience. I felt physically sick, remember nothing of what come out of my mouth, had my legs so shaky I had hard time standing and probably managed to just about say my name from that stage. It became obvious if I had not given it credit before: my fear of public speaking was going to be a blocker on things I might want to do.

I did not get the job, nor do I know whether it was how awful my public speaking was or if it was my absolute lack of ability to tell jokes intentionally or some of many other good reasons, but I can still remember how I felt on that stage on that day. I can also remember the spite that drove me on taking actions since then and over time gave sense of purpose for many actions to be educated and practice the skills related to public speaking.

Somewhere along the line, I got over the fear and had done talks and trainings in the hundreds. Then a EuroSTAR program chair of the year decided to reflect on why he chose no women to keynote quoting "No women of merit", and I found spite-driven development to become a woman of merit to keynote. I did my second EuroSTAR keynote this June.

Over the years of speaking, I learned that the motivations or personal benefits to a speaker getting on those stages are as diverse as the speakers. I was first motivated by a personal development need, getting over a fear that would impact my future. Then I was motivated by learning from people who would come and discuss topics I was speaking about, as I still suffer from a particular brand of social anxiety on small talk and opening conversations with strangers. But I collected status, I lost a lot of that value with people thinking they needed something really special to start a conversation with me. I travelled the world experiencing panic attacks and felt sometimes the loneliest in big crowds and conferences, with all those extroverts without my limitations. In recent years, I find I have been speaking because I know my lessons from doing this - not consulting on this - are relevant and speaking gives me a reflection mechanism of how things are changing as I learn. However, it has been a while since getting on that stage has felt like a worthwhile investment.

Recently, I have been speaking on habit and policy. I don't submit talks on call for proposals, and I have used that policy to illustrate that not having women speakers is conference organizers choice as many people like myself will say yes to an invitation. The wasteful practice of preparing talks when we really should be collaborating is something I feel strongly on, still today. The money from speaking from stages isn't relevant, the money from trainings is. In last three years with Vaisala, I have had all the teams in the whole company to consult at will and availability, and I have not even wanted to make the time for all of the other organizations even though I still have a side business permission. I just love the work I have there where I have effectively already had four different positions within one due to the flexibility they built for me. Being away to travel to a conference feels a lot more like a stretch and significant investment that is at odds with other things I want to do.

The final straw to change my policy I got from EuroSTAR feedback. In a group of my peers giving anonymous feedback, someone chose to give me feedback beyond the talk I delivered. I am ok with feedback that my talk today was shit. But the feedback did not stop there. It also made a point that all of my talks are shit. And that the feedback giver could not understand why people would let me on a stage. That has hateful and we call people like this trolls when they hide behind anonymity. However, this troll is a colleague, a fellow test conference goer.

Reflecting my boundaries and my aspirations, I decided: I retire from public speaking, delivering the talks I have already committed to but changing my default to invitations to No. I have done 9 No responses since the resolution, and I expect to be asked now less since I have announced unavailability.

It frees time for other things. And it tells you all that would have wanted to learn from me that you need to reign in the trolls sitting in you and amongst you. I sit in my power of choice, and quit with 527 sessions, adding only paid trainings on my list of delivered things for now.

I'm very lucky and privileged to be able to choose this as speaking has never been my work. It was always something I did because I felt there are people who would want to learn from what I had to offer from a stage. Now I have more time for 1:1 conversations where we both learn.

I will be doing more collaborative learning and benchmarking. More writing. Coding an app and maybe starting a business on it. Writing whenever I feel like it to give access to my experiences. Hanging out with people who are not trolls to me, working to have less trolls by removing structures that enable them. Swimming and dancing. Something Selenium related in the community. The list is long, and it's not my loss to not get on those stages - it's a loss for some people in the audiences. I am already a woman of merit, and there's plenty more of us to fill the keynote stages for me to be proud of.

The Documentation Conundrum

2023-05-18T13:53:00.003+03:00

Writing things down is not hard. My piles of notebooks are a testament to that - I write things down. The act of writing is important to me. It is how I recall things. It is how I turn something abstract like an idea into something I can see in my minds eye on a physical location. It frees my capacity from the idea of having to remember. Yet it is a complete illusion.

I don't read most of the stuff I write down. I rarely need to go back to read what I wrote. And if I read what I wrote, it probably would make less sense than I thought in the moment, and would incur a huge cost in reading effort.

Yet, the world of software is riddled with the idea of writing things down and expecting that people would read them. We put new hires through the wringer of throwing them at hundreds of pages of partially outdated text and expect this early investment into bad documentation to save us from having to explain the same things as new people join. Yet the reality is that most of the things we wrote down, we would be better off deleting.

We think that writing once means reading in scale, and it might be true for a blog post. To write once in a form that is worth reading in scale either takes a lot of effort from the writer or happens to touch something that is serendipitously useful. Technical documentation should not be serendipitously useful, but it should be worth reading, in the future that is not here yet. It should help much more than hinder. It should be concise and to the point.

Over the course of a year, I have been on an experiment on writing down acceptance criteria. I set the experiment with a few thoughts:

product owner should not write the acceptance criteria, they should review and accept the acceptance criteria - writing is more powerful clarification tool than reading, and we need most power to clearing mistakes where they end up in code
acceptance criteria is output of having developed and delivered - we start writing them as we have conversations, but they are ready when the feature is ready and I will hold together discipline of writing down the best knowledge as output
question format for accepting / rejecting feels powerful and is also something that annoys both the people above me in org charts who would choose "shall" requirements format and a product owner believed it was important to change format, thus we will
acceptance criteria exist on epic level that matches a feature, smallest thing we found worth delivering and mentioning - it's bigger than books recommend but what is possible today drives what we do today

So for a year, I was doing the moves. I tried coaching another tester into writing acceptance criteria, someone who was used to getting requirements ready and they escaped back to projects where they weren't expected to pay attention to discovering agreements towards acceptance but it was someone else's job. I tried getting developers to write some, but came to the notion of collecting from conversations being a less painful route. I learned that my best effort in writing acceptance criteria before starting feature was 80% of the acceptance criteria I would have discovered by being done with a feature, fairly consistently. And I noted that I felt very much like I was the only one, through my testing activities, who hit uncertainties of what our criteria had been and what it needed to be to deliver well. I used the epics as anchors for testing of new value, and left behind 139 green ticks.

Part of my practice was to also collect 'The NO list' of acceptance criteria that someone expected to be true that we left out of scope. With the practice, I learned that what was out of scope was more relevant to clarify, and would come back as questions much more than what was in scope. 18 of the items on 'The NO list' ended up being incrementally addressed, leaving 40 still as valid as ever on time of my one-year check.

For a whole year, no one cared for these lists. Yesterday, a new tester-in-training asked for documentation on scope and ended up first surprised that this documentation existed as apparently I was the only one fully aware of it it the team. They also appeared a little concerned on the epics incremental nature and possibility that it was not all valid anymore, so I took a few hours to put a year of them on one page.

The acceptance criteria and 'the NO list' created a document the tool estimates takes 12 minutes to read. I read them all, to note they were all still valid, 139 green ticks. 'The NO list' items, 58 of them had 31% to remove as we had since added those in scope.

The upcoming weeks and conversations show if the year's work on my part to be disciplined for this experiment is useful to anyone else. As artifact, it is not relevant to me - I remember every single thing now having written it down and using it for a year. But 12 minutes reading time could be worth the investment even for a reader.

On the other hand, I have my second project with 5000 traditional test cases written down, estimating 11 days of reading if I ever wanted to get through them, just to read once.

We need some serious revisiting on how we invest in our writing and reading, and some extra honesty and critical thinking in audiences we write to.

You yourself are a worthwhile audience too.

Stop Expecting Rules

2023-04-26T22:33:00.002+03:00

Imagine having a web application with a logout functionality. The logout button is on the top LEFT corner. That's a problem, even if your requirements did not state that users would look for logout functionality on the top RIGHT corner. This is a good example of requirements (being able to logout) and implicit requirements (we expect to find it on the right). There are a lot of things like this where we don't say all the rules out loud.

At work, we implemented a new feature on sftp. When sftp changes, I would have tested ftp. And it turns out that whoever was testing that did not have a *test case* that said to test ftp. Ftp was broken in a fairly visible way that was not really about changing sftp, but about recompiling C++ code to make a latent bug of 2009 now visible. Now we fixed ftp, and while whoever was testing that now had a test case saying to testing sftp and ftp as a pair, I was again unhappy. With the size of the change to ftp, I would not waste my time testing sftp. Instead, I had a list of exactly 22 places where we had the sonar tool identify problems exactly as this used-to-be-latent one, and I would have sampled some of those in components we had changed recently.

Search of really simple rules fails you. If you combine two things for your decision, the number of parameters is still small.

In the latter case, the rules are applied for a purpose of optimising opportunity cost. To understand how I make my decisions - essentially exploratory testing - would require balancing the cost of waiting another day for the release, the risks I perceive in the changes going with the release, the functional and structural connections in a few different layers. The structural connection in this case had both size and type of the change driving my decisions on how we would optimize opportunity cost.

I find myself often explaining people that there are no rules. In this case, it would hurt timelines for maybe a few hours to tests ftp even when those few hours would be better used elsewhere. The concern is not so much on the two hours wasted, but on not considering the options of how that two hours could be invested - on faster delivery or on better choice of testing that supports our end goal of having an acceptable delivery on the third try.

A similar vagueness of rules exist with greeting people. When two people meet, who says hello first? There are so many rules, and rules on what rules apply, that trying to model it all would be next to impossible. We can probably say that in a Finnish military context, the rank plays to the rules of who starts and punishment of not considering the rules during rookie season is teaching you rule following. Yet the amount of interpretations we can make on other's intentions when passing someone at the office without them (or us) saying hello - kind of interesting sometimes.

We're navigating a complex world. Stop expecting rules. And when you don't expect rules, your test cases you always run will make much less sense than you think they do.

On Test Cases

2023-04-15T23:00:00.003+03:00

25 years ago, I was a new tester working with localisation testing for Microsoft Office products. For reasons beyond my comprehension, my 1st ever employer in IT industry had won localisation project for four languages and Greek was one of them. The combination of my Greek language hobby and my Computer Science studies turned me into a tester.

The customer provided us test cases, essentially describing tours of functionalities that we needed to discover. I would have never known how certain functionalities of Access and Excel work without those test cases, and was extremely grateful for the step by step instructions on how to test.

Well, I was happy with them until I got burned. Microsoft had a testing QA routine where a more senior tester would take exactly the same test cases that I had, not follow the instructions but be inspired by the instructions and tell me how miserably I had failed at testing. The early feedback for new hires did the trick, and I haven't trusted test cases as actual step-by-step instructions since. Frankly, I think we would have been better off if we described those as feature manuals over test cases, I might have gotten the idea that those were just a platform a little sooner.

In years to come, I have created tons of all kinds of documentation, and I have been grateful for some of it. I have learned that instead of creating separate test case documentation, I can contribute to user manuals and benefit the end users - and still use those as inspiration and reminder of what there is in the system. Documentation can also be a distraction and reading it an effort away from the work that would provide us result, and there is always a balance.

When I started learning more on exploratory testing, I learned that a core group figuring that stuff out had decided to try to make space between test cases (and the steps those entail) by moving to use word charter, which as a word communicates the idea that it exists for a while, can repeat over time but isn't designed to be repeated, as the quest for information may require us to frame the journey in new ways.

10 years ago, I joined a product development organization, where the manager believed no one would enjoy testing and that test cases existed to keep us doing the work we hate. He pushed very clearly me to write test cases, and I very adamantly refused. He hoped for me to write those and tick at least ten each day to show I was working, and maybe if I couldn't do all the work alone, the developers could occasionally use the same test cases and do some of this work. I made a pact of writing down session notes for a week, tagging test ideas, notes, bugs, question and the sort. I used 80% of my work time on writing and I wrote a lot. With the window to how I thought and the results in the 20% time I had for testing, I got my message through. In the upcoming 2,5 years I would find 2261 bugs I would report that also got fixed, until I learned that pair fixing with developers was a way of not having to have those bugs created in the first place.

Today, I have a fairly solid approach to testing, grounded on exploratory testing approach. I collect claims, both implicit and explicit and probably have some sort of a listing of those claims. You would think of this as a feature list, and optimally it's not in test cases but in user facing documentation that helps us make sense of the story of the software we have been building. To keep things in check, I have programmatic tests, where I have written some, many have been written because I have shown ways systems fail, and they are now around to keep the things we have written down in check with changes.

I would sometimes take those tests, and make changes on data, running order, or timing to explore things that I can explore building on assets we have - only to throw most of those away after. Sometimes I would just use the APIs and GUIs and just think in various dimensions, to identify things inspired by application and change as my external imagination. I would explore alone without and with automation, but also with people. Customer support folks. Business representatives. Other testers and developers. Even customers. And since I would not have test cases I would be following, I would be finding surprising new information as I grow with with the product.

Test cases are step by step instructions on how to miss bugs. The sooner we embrace it, the sooner we can start thinking about what really helps our teams collaborate over time and changes of personnel. I can tell you: it is not test cases, especially in their non-automated format.

Tale of Two Teams

2023-04-15T14:36:00.000+03:00

Exploratory testing and creating organizational systems encouraging agency and learning in testing has been a center of my curiosity in testing space for a very long time. I embrace title of exploratory tester extraordinaire assigned to me by someone whose software I broke in like an hour and half. I run exploratory testing academy. And I research - empirically, in project settings at work - exploratory testing.

Over the years people have told me time and time again that *all testing is exploratory*. But looking at what I have at work, it is very evident this is not true.

Not all projects and teams - and particularly managers - encourage exploratory testing.

To encourage exploratory testing, your focus of guidance would be on managing performance of testing, not the artifact of testing. In the artifacts you would seek structures that invest more in artifacts when you know the most (after learning) and encourage very different artifacts early on to support that learning. The reason that the most known examples of managing exploratory testing focus on new style of artifacts: charters and session notes over test cases. The same idea of testing as performance (exploratory testing) and testing as artifact creation (traditional testing) repeats in both manual-centering and automation-centering conversations.

Right now I have two teams, one for each subsystems of the product I am managing. Since I have been managing development teams only since March, I did not build the teams, I inherited them. And my work is not to fix the testing in them, but to enable great software development in them. That is, I am not a test manager, I am engineering manager.

The engineering culture of the teams is very essentially different. Team 1 is what I would dub 'Modern Agile', with emergent architecture and design, no jira, pairing and ensembling. Team 2 is what I would dub as 'Industrial Agile', with extreme focus on Jira tasks and visibility, and separation of roles, focusing on definition of ready and definition of done.

Results from the whole team level is also very essentially different - both in quantity and quality. Team 1 has output increasing is scope, and team 2 struggles to deliver anything with quality in place. Some of the differences can be explained by the fact that team 1 works on new technology delivery and team 2 is on a legacy system.

Looking at the testing of the two teams, the value system in place is very different.

Team 1 lives with my dominant style of Contemporary Exploratory Testing. It has invested in baselining quality into thousands of programmatic tests run in hundreds of pipelines daily. Definition of test pass is binary green (or not). The programmatic tests running is 10% of effort, maintenance and growing of them is a relevant percentage more but in addition we spend time exploring with and without automation, documenting new insights again in programmatic tests on lowest possible levels. Team 1 had first me as testing specialist, then decided on no testers but due to unforeseeable circumstances have again a tester in training participating in the team work.

Team 2 lives in testing I don't think will ever work - Traditional Testing. They write plans, test cases and execute same test cases in manual fashion over and over again. When they apply exploratory testing it means they vanish from regular work for a few days, do something they don't understand or explain to learn a bit more on a product area, but they always return back to test cases after the exploratory testing task. Team 2 testing finds little to none of bugs, but gets releases returned as they miss bugs. With feedback of missing something, they add yet another test case to their lists. They have 5000 test cases, and run a set of 12 for releases, and by executing the same 12 minimise their chances of being useful.

It is clear I want a transformation from Traditional Testing to Contemporary Exploratory testing, or at least to Exploratory testing. And my next puzzle at hand is on how to do that transformation I have done *many* times over the years as the lead tester as a development manager.

At this point I am trying to figure out how to successfully explain the difference. But to solve this, I have a few experiments in mind:

Emphasize time on product with metrics. Spending your days in writing of documentation is time away from testing. I don't need all that documentation. Figure out how to spend time on the application, to learn it, to understand it, and to know when it does not work.
Ensemble testing. We'll learn how you look at an application in context of learning to use it with all the stakeholders by doing it together. I know how, and we'll figure out how.

Memoirs of Bad Testing

2023-03-22T22:31:00.004+02:00

I'm in a peculiar position right now. I am a manager of a manager who manages test manager managing themselves and a tester. Confused already - I am too.

The dynamic is clear through. I get to let go the tester for not thinking their work isn't worth paying for. And that is kind of what I am thinking. But it is not that straightforward. All testers, gather around, this needs a lot more than just throwing it at your face like this.

What Good Testing Looks Like

Having been a tester for such a long time and still being a tester on side of my other dark-side managerial duties. I know what good testing looks like and I would buy some in a heartbeat. Some examples of that testing I vouch for are embodied in people like Ru Cindrea (and pretty much anyone at Altom I have seen test, including past people from Altom like Alessandra Casapu, pure testing brilliance), Elizabeth Zagroba, Joep Schuurkes and James Thomas. And my idol of all times, Fiona Charles, who I attribute to a special blend of excellence in orchestrating testing of this kind.

If there is a phrase that good testing often comes to me with, it is the words *exploratory testing*. Not the exploratory testing that you put into a single task as a reason to say that you need extra time to think, but the kind that engulfs all your testing, makes you hyper focus on results and aware of cost of premature documentation as well as opportunity cost, and positively pushes you to think about risks and results as an unsplittable pair where there is not one without the other.

These kinds of testers have product research mindset, they are always on the lookout of how they invest their limited time the best in the context at hand and they regularly surprise their teams with information we did not have available regardless of how much automation we have already managed to accrue.

Seniority much?

While it may sound like I am describing very senior people, I have seen this in 1st year testers, and I have seen lack of this in 10th year testers. It is not about level of experience, it is about level of learning, and attitude to what it means to do testing.

What does it look like when the work may not be worth paying for?

While writing documentation and notes and checklist is good, writing test cases is bad. Generally speaking of course. So my first smell towards recognising I might come to a case of testing not worth paying for is test case obsession.

Test Case Obsession is counting test cases. Defining goals of test automation in perspective of manual test cases automated. Not understanding that good test automation means decomposing the same testing problems in a different way, not mimicking your movements manually.

Its listing test cases and using the very same test cases framing them into "regression suite", "minor functionality suite" and "major functionality suite" and avoiding thinking about why we test today - what is the change and risk - and hoping that following a predetermined pattern would be a catch all important with regards to results.

If this is what you are asked to do, you would do it. But if you are asked to follow the risk and adjust your ideas of what you test, and you still refuse, you are dangerously on the territory of testing not worth paying for. When that is enhanced with missing simple basic things that are breaking because you prioritise safety and comfort of repeating plans over finding the problems with lack of information your organization hired you to work on, you are squarely on my testing not worth paying for turf.

The Fixing Manoeuvres

I've done this so many times over the last 25 years that it almost scares me.

First you move the testing not worth paying for into a team of its own and if you can't stop doing that work, you send it somewhere where it gets even thinner on results by outsourcing it. And in 6-12 months, you stop having that team. In the process you have often replaced 10 testers with 1 tester and better results.

Manager (1) - Manager (2) - Manager (3) - Tester

Being the Manager to Testers, I can do the Fixing Manoeuvres. But add more layers, and you have a whole different dynamic.

First of all, being the manager (1), I can ask for Testing Worth Paying For (tm) while manager (2) is requiring testing not worth paying for, and manager (3) and tester being great and competent folks find themselves squeezed from two sides.

Do what your manager says and their manager will remove you.

Do what your managers manager say and your manager will nag you because they have been taught that test cases are testing, and know of nothing better.

Long story, and Morale

We have an epidemic of good managers with bad ideas about how to manage testing and this is why not all testing is not exploratory testing. Bad ideas remove a lot of the good from testing. Great testers do good testing even in presence of bad ideas, but it is so much harder.

We need to educate good managers with bad ideas and I have not done my part as I find myself in a Manager - Manager - Manager - Tester chain. I will try again.

Firing good people you could grow is the last resort while also a common manouvre. Expecting good testing (with mistakes and learning) should be what we grow people to.

The Browser Automation Newbie Experience

2023-02-17T23:30:00.004+02:00

Last week I called for Selenium newbies to learn to set up and run tests on their own computer, and create material I could edit a video out of. I have no idea if I ever manage to get to editing the video, but having now had a few sessions, writing about it seemed like a good use of an evening.

There are two essentially different things I mean by a newbie experience:

Newbies to it all - testing, programming, browser automation. I love teaching this group and observing how quickly people learn in context of doing.
Newbies to browser automation library. I love teaching this group too. Usually more about testing (and finding bugs with automation), and observing what comes in the way of taking a new tool into use.

I teach both groups in either pair or ensemble setting. In pairs, it's about *their* computer we pair on. In ensemble, it about one person's computer in the group we pair on.

The very first test experience is slightly different on the two and comprises three parts:

From empty project to tools installed
From tools installed to first test
Assessing the testing this enables

I have expressed a lot of love for Playwright (to the extent of calling myself a playwright girl) and with actions of using it in projects at work. But I have also expressed a lot of love for Selenium (to the extent that I volunteer for the Selenium Project Leadership Committee) and with encouraging using it in projects at work. I'm a firm believer that *our* world is better without competition and war, rather admiration and understanding and sustainability.

So with this said, let's compare Playwright and Selenium *Python* on each of these three.

From Empty Project To Tools Installed - Experience Over Time

Starting with listing dependencies in requirements.txt, I get the two projects set up side by side.

Playwright	Selenium
pytest playwright pytest-playwright	pytest selenium

I run pip install -r requirements.txt and ... I have some more to do for playwright: running playwright install from command line, and watching it download superset browsers and node dependencies.

If everything goes well, this is it. At time of updating the dependencies, I will repeat the same steps, and I find the playwright error message telling me to update in plain English quite helpful.

If we tend to lock our dependencies, need of update actions are on our control. On many of the projects, I find failing forward to the very latest to be a good option, and I take the unusual surprise of update over the pain of not having updated almost any day.

From Tools Installed to First Test

At this point, we have done the visible tool installation and we only see if everything is fine if we use them tools and write a test. Today we're sampling the same basic test on todo mvc app, the js vanilla flavour.

We have the Playwright-python code

With test run on chromium, headed:

== 1 passed in 2.71s ==

And we have the Selenium-python code

With test run on chrome Version 110.0.5481.100 (Official Build) (arm64)

== 1 passed in 6.48s ==

What I probably should stress in explaining this is that while 3 seconds is less than 7 seconds, more relevant for overall time is the design we go to from here on what we test while on the same browser instance. Also, I have run tests for both libraries on my machine earlier today which is kind of relevant for the time consideration.

That is, with Playwright we installed the superset browsers with the command line. With Selenium, we did not because we are using the real browsers and the browser versions on the computer we run the tests on. For that to work some of you may know that it requires a browser-specific driver. The most recent Selenium versions have built-in SeleniumDriver that is - if all things are as intended - is invisible.

I must mention that this is not what I got from reading Selenium documentation, and since I am sure this vanishes soon as in gets fixed, I am documenting my experience with a screenshot.

I know it says "No extra configuration needed" but since I had -- outside path - an older ChromeDriver, I ended up with an error message that made no sense with the info I had and reading the blog post, downloaded a command line tool I could not figure out how I would use. Coming to this with Playwright, I was keen to expected command line tool instead of automated built-in dependency maintenance feature and the blog post just built on the assumption.

Assessing the Testing This Enables

All too often we look at the lines of code, the commands to run, and the minutiae of maintaining the tool, but at least on this level, it's not like the difference is that huge. With python, both libraries on my scenarios rely on existence of pytest and much of the asserting stuff can be done the same. Playwright has been recently (ehh, over few years, time flies I have no clue anymore) been introducing more convenience methods for web style expects like expect(locator()).to_have_text() and web style locators like get_by_placeholder(), yet these feel like syntactic sugar to me. And sugar feels good but sometimes causes trouble for some people in the long run. Happy to admit I have both been one of those people, and observed those people. Growing in tech understanding isn't easier for newbie programmers when they get comfortable with avoiding layered learning for that tech understanding.

If it's not time or lines, what it it then?

People who have used Selenium are proficient with Selenium syntax and struggle with the domain vocabulary change. I am currently struggling the other way around with syntax, and with Selenium the ample sample sets of old versions of APIs that even GitHub copilot seems to suggest wrong syntax from isn't helping me. Longer history means many versions of learning what the API should look like showing in the samples.

That too is time and learning, and a temporary struggle. So not that either.

The real consideration can be summed with *Selenium automates real browsers*. The things your customers use your software on. Selenium runs on the webdriver protocol that is a standard the browser vendors use to allow programmatic control. There's other ways of controlling browsers too, but using remote debugging protocols (like CFP for Chromium) isn't running the full real browser. Packaging superset browsers isn't working with the real versions the users are using.

Does that matter? I have been thinking that for most of my use cases, not. Cross-browser testing the layers that sit on top of frontend framework or on top of a platform (like salesforce, odoo or Wordpress), there is only so much I want to test the framework or the platform. For projects creating those frameworks and platforms, please run on real browsers or I will have to as you are risking bugs my customers will suffer from.

And this, my friend, this is where experience over time matters.

We need multiple browser vendors so that one of them does not use vendor lock in to do something evil(tm). We need the choice of priorities different browser vendors make as our reason to choose one or another. We need to do something now and change our mind when the world changes. We want our options open.

Monoculture is bad for social media (twitter) with change. Monoculture is bad for browsers with change. And monoculture is bad for browser automation. Monoculture is bad for financing open source communities. We need variety.

It's browser automation, not test automation. We can do so much more than test with this stuff. And if it wasn't obvious from what I said earlier in this post, I think 18-year old selenium has many many good years ahead of it.

Finally, the tool matters less than what you do with it. Do good testing. With all of these tools, we can do much better on the resultful testing.