Thursday, June 27, 2024

Well, did AI improve writing test plans?

Over 27 years of testing, I've read my fair share of test plans, and contributed quite a chunk. I've created templates to support people in writing it slightly better, and I've provided examples. And I have confessed that since it's the performance of planning not the writing of the plan that matters, some of the formats that support me in going my best work in planning like the one-page-master-test-plan have created some of the worst plans I have seen from others. 

Test plans aren't written, they are socialized. Part of socializing is owning the plan, and another part is bringing everyone along for the ride while we live by the plan. 

Some people need a list of risk we are concerned for. I love the form of planning where we list risks, and map that to testing activities to mitigate those risks. And I love making documents so short that people may actually read them. 

In these few years of LLMs being around, obviously I have had to both try generating test plans and watch others try to ask me to review plans where LLM has been more of less helpful.

Overall, we are bad at writing plans. We are bad at making plans. We are really bad at updating plans when things change. And they always change. 

I wanted to make a note on did AI change it already?

  • People who were not great at writing plans before are not great in a different way. When confronted with feedback, they now have the Pinocchio to blame for it. Someone else did it - the LLM. If the average of the world does this, how can I make the point of not being happy with it. And I can be less happy with the responsibility avoidance. 
  • People who need to write plans that were not really even needed except for the process are now more efficient in producing the documentation. If they did not know to track TFIRPUSS (quality perspectives) in their plans, at least they are not missing that when they ask the tool. The difference still comes from the performance of continuous planning and organizing for the actions, rather than the act of writing the plan. 
  • Detailed test ideas for those with least ideas are already better in per-feature plans. Both those who were not so great are doing slightly better with generated ideas and those who were great become greater because they compete with Pinocchio. 
I am worried my future - like my past - is in reviewing subpar plans, but now with externalized responsibility. Maintaining ownership matters. Pinocchio is not a real boy, we're just at the start of the story. 


Monday, June 24, 2024

Testing gives me 3x4

You know how you sometimes try to explain testing, and you do some of your better work in explaining it as response something that is clearly not the thing. This happened to me, but it has been long enough since, and what time gives me is integration of ideas. So this post is about showing you the three lists that now help me explain testing.

Speaking of testing to developers, I go for the list of what developer-style test automation, especially done test-driven style, gives me. It gives me a specification that describes my intent as developer. It gives me feedback if I am making progress to my specification. Well, it also gives me a concrete artifact to grow as I am learning, an intersection of specification and feedback. Since I can keep the tests around, their executable nature gives me protection for regression, helping me track my past intent. And when my tests fail, they give me granularity on the exact omission that is causing the fail. At least if I did those well. 



The lists helps me recall why would I want to make space for test-driven development, or capturing past intent in unit / api level approval tests of behaviors someone else had intent for, even if I am not using energy to track back to those. 

When I look at this list, I think back to a Kevlin Henney classic reminder: 

"A majority of production failures (77%) can be reproduced by a unit test."  
The quote reminds me of two things: 
  1. While majority can be reproduced, we get to these numbers in hindsight looking at the escapes we did not find with unit tests before production.
  2. The need of hindsight and the remaining significant portion indicate there is more. 
What we're missing is usually framed not in feedback, but in discovery. I tend to call that discovery exploratory testing, some others tend to call it learning. If I wanted to have a list of what this style of testing, one that frames testing as performance with the product and components as external imagination, I again have a list of four things it gives me. It gives me guidance, like a vector of possible directions but one that still requires applying judgment. It gives me understanding, seeing how things connect and what it brings in as stakeholder experience. User, customer, support, troubleshooting, developer... All relevant stakeholders. It gives me models on what we have and how it's put together, what is easy and what is hard to get right. And it gives me serendipity, lucky accidents that make me exclaim: "I had no idea these things would not work in combination". 



The two lists are still as helpful to me as they were when I created them, but a third list has emerged for me since from my work with contemporary exploratory testing. I find that the ideas of spec-feedback-regression-granularity miss out on things we could do on artifact creation in frame of exploratory testing, and have been cultivating and repeating a third list. 

In frame of contemporary exploratory testing, programmatic tests (test automation) gives me more. It gives me documenting the insights as I am learning them, and it gives me that spec-feedback-regression-granularity loop. It's one of the models, one that I can keep around for further benefit. It gives me extending reach, be it testing things in scope I already have automated again and again, or be it taking me to repetition I would not do without programmatic tests for new unique scenarios, or even getting me to a starting place where I can start exploring for new routes, making space for that serendipity. It gives me alerting to attend when a test fails, alerting me that someone forgot to tell me something so that I can go and learn, or that there's a change that breaks something we care for. It keeps my mind free for deeper, wider, more, trusting the tingly line red over a past green. And finally, it gives me guiding to detail. I know so much more about how things are implemented, how elements are loaded, when they emerge, and what are the prerequisites I could easily skip if I didn't have the magnifying element programmatic tests bring in. 
All of my tests don't have to stay around in pipelines. Especially when extending reach, I am likely to create programmatic tests that I discard. Ones that collect data for a full day, seeking for a particular trend.

Where I previously explained testing with two lists of four, I now need three lists of four. And like my lists were created as a response, you may choose to create your own lists as response. Whatever helps you - or in this case me - explain that the idea you don't need manual testing at all relies on the idea that creating your automation is framed as a contemporary exploratory testing process. Writing those programmatic tests is still, even in genAI era, an awfully manual process of learning about how to frame our intent. 

Friday, June 21, 2024

Memory Lane 2005: ISTQB Foundation Syllabus, Principles of Testing

Some people create great visuals. One of those that I appreciated today was to turn 7 principles of testing into a path around number 7, by Islem Srih. As my eyes were tracking through the image, a sense of familiarity hit me. This is the list of 7 I curated for ISTQB Foundation Syllabus 1st edition in 2005. The one I still joke about holding copyright to, since I never signed off my rights. ISTQB did ask, I did not respond. 

Also, it is a list that points to other people's work. Back then, as a researcher, I was collating not reimagining. Things I consider worth my time has changed since, and I would not contribute to ISTQB syllabi anymore.

Taking the trip back memory lane, I had to look at what I started with to build the 7 principles. I recognize editions since have changed labels - thing I called *pesticide paradox* to honor Boris Beizer's work is now "tests wear out".  I'm pretty sure I would have called Dijkstra's on the absence-not-presence principle as he is the originator of that idea, and I find it impossible to believe I would have penciled in [Kaner 2011] for something that is very obviously [Kaner 2001]. It is incorrect in the latest published syllabus. Back then I knew who said what, and had nothing to say myself as I imagined it was all clear already. Little did I know... 

The 7 principles that made the cut were ones where we did not have much of arguments. 



I have the 2005 originals that I was no longer able to find online easily. I can confirm I did not mess up the references, because I did not put in the references. Now that I look at this, the agreement was not to put in references unless they were recommended reading to complete the syllabus.
I also went back to look at principles I started working from to get to these 7. The older set of principles is from ISEB Foundation. 

Comparing the two

ISTQB:

  • Presence not absence
  • Exhaustive is impossible
  • Early testing
  • Defects cluster
  • Pesticide paradox
  • Context-driven
  • Absence of errors fallacy
ISEB

  • Exhaustive testing is usually impractical
  • Testing is risk-based and risk must be used to allocate time
  • Removal of faults potentially improves reliability
  • Testing is measurement of quality
  • Requirements may determine the testing performed
  • Difficult to know how much is enough
It was clear we did not agree on the latter. While these may be principles that are much better than X normally sees as per one of the comments on the post that inspired me to take this memory lane trip, I am no longer convinced they are truly the principles worth sharing. But they do make rounds. I am also not convinced the additions and explanations improved them, at least on their generalized nature. 

In hindsight, I can see what I did there. Grounded the syllabus a bit more on things in research. Some of which I don't think we've properly done to this day (early testing for example - jury is still out; clustering of defects feels more like a folklore than principle; pesticide paradox more of an encouragement to explore than tests actually becoming ineffective).

Perhaps revisiting what is truly true would now be good. Now that I no longer think my greatest contribution to testing is to know what everyone else says and not say anything myself. 

Two Years with Selenium Project Leadership Committee

Today I am learning that I am not great - even uncomfortable - using voice of an organization. And for two years, I have held a small part of voice of an organization, being a member of the Selenium Project Leadership Committee. I have been holding space and facilitating volunteer organizations for over three decades, and just from my choice of words you note that I don't say I have been leading or running or managing even those probably would be fair words for other people to assign watching the work I do.

Volunteer organizations are special. It is far from obvious that they stay around and active for two decades, like Selenium project has. It is far from obvious they survive the changes of leadership, but for an organization to outlive people's attention span, it is necessary. It is also far from obvious that they manage to navigate the complex landscape of individual and corporate users, their hopes and expectations, other collaborators on solving similar problems, and create just enough but not too much governance to have that voice of an organization. 

If I was comfortable using voice of an organization, I would not be nervous speaking in Selenium Conference 2024 today with Diego Molina and Puja Jagani on 'State of Union' keynote. Similarly, I might find it in me to write a post in the official Selenium Blog since I even have admin accesses to Selenium repos on GitHub. But it's just so much easier to avoid all the collaboration and write in my own voice and avoid the voice of an organization that I always hold in then back of my mind - or my heart. 

In 2018 Selenium Conference India I was invited to keynote, and I chose to speak about Intersection of Automation and Exploratory Testing. Back then I did not know I would end up combining the paths under the flag of Contemporary Exploratory Testing and I most definitely did not know that the organizers inviting me to keynote would bring me to an intersection where my path aligned with the Selenium Project. 

In 2018 I keynoted. A little later, I volunteered with program committee reviewing proposals, again invited by the people already in. And in 2022, I ended up with Selenium Project Leadership Committee (PLC), again on an invite. 20 years of Selenium now in 2024 marks 2 years in PLC for me. 

Today I am wondering what do I have to show for it. And like I always do when I start that line of thinking, I write it down. With my voice. 


Making sense to what PLC (project leadership committee) is in relation to TLC (technical leadership committee) wasn't an easy one to figure out. There's the code, packaging that to releases and the related documentation and messaging, and all of that is lead by TLC. If there is no software, there is no software project. Software that does not change is dead, and in two years I have reinforced the understanding that Selenium is not dead, or dying. It's alive, well, and taking next steps in fairly complex world of cross-browser standards-based agreements to then implement something everyone, not just the project's immediate web driver implementations use. Selenium project is the pioneer and collaborator in building up the modern web. 

The first thing I have to show for two years of Selenium PLC is to understand that what Selenium is. That is many things. It is multiple components to this entire ecosystem. It's working for standards to unite browsers. It's community of people who find this interesting enough to do this as a hobby, volunteering, unpaid. It's where we grow and learn together. It's taking pride in inspiring other solutions in browser automation space whether they are built on top of Selenium or if there's other choices than trying to get real browsers continue on their path of differentiation but still allow for automating real browsers with a unified web driver library. 

What have I been doing in the project then, other than making sense of it? 

  • Realizing there is a lot of emails people want to send to "Selenium". There's a lot of requests on privacy and security assessments that organizations using Selenium think they can just ask "the company" - where there is no company. There's a lot of proposals for collaboration, especially in marketing. And there's donations / sponsoring - because even a volunteer project needs money to pay for its tools and services. 
  • Fighting against becoming the "project assistant". Getting people to get together, someone needs to schedule it. Calendars as self-service aren't a thing everywhere in the busy tech world. After fighting enough personal demons, I did end up putting up a calendar invite and taking up the habit of posting on slack after every conversation to share what the conversation was about. Seems that bit is something I have enough routine to run with. 
  • Picking and choosing what/how I want to contribute: I wanted to move from face to face conferences and online broadcasts to online conversations, and set up Selenium Open Space Conference for it. I loved the versatility of sessions, and particularly the one person showing up with his own open source project he would walk different people through in various sessions. I also loved learning about how Pallavi Sharma ended up publishing three books on Selenium, and what goes on in the background of authoring books. 
  • Picking and choosing what/how I want to contribute: setting up micro sponsorship model and starting steps towards full financial transparency. I will need other two years to complete the full financial transparency, but that is something I believe in heavily. I want it to be normal to say that Selenium project holds $500k on Software Freedom Conservancy. You can dig out that info from their annual reports. But I also want to say that in addition to what is held there, it would all be visible on Open Collective. We have raised $1,684.49 since we started off with the Selenium community money, and there's many steps to take to move host of that money to be Selenium/BrowserAutomation Inc with every transaction transparently visible. 
  • Fiscal hosting ended up being something I specialize in. We have five people in PLC, where two of them do double duty with TLC, and every one of us holds paying job with relevant levels of responsibility in different companies. Within the two years we had also two more people in PLC, Bill McGee and Corina Pip, brilliant folks we now hold space to become free from other engagements to rejoin our efforts, after some fiscal hosting / governance stuff is better sorted. Knowing where you are is start of sorting it out. 
  • Setting up Selenium in Mastodon. We could really use someone taking care of Selenium in Social Media overall, but I did create a placeholder for content even if my content-producing with voice of organization is sporadic at best. 
  • Introducing a new class of sponsors, where companies sponsor with allocating people to contribute significant work time to project and be recognized for it. Not like I did this alone, but working in a user organization rather than vendor organization, I just happen to be able to hold space for something like this by positioning. 
  • Dealing with people. Some of that is overwhelming, but at the same time necessary. People have ideas, and sometimes the ideas need aligning. Mostly I listen but I would also step in to mediate. And have been known to do so. 
If there is a future I have an impact on for Selenium,  it will be: 
  1. Diverse Community Centric and focused on Collaboration. 
  2. Fiscally transparent and easy to access for taking stuff forward. 
  3. Worthwhile for users and contributors - individual and corporate. 
Some days - most days - I feel completely insufficient with the amount of time I can volunteer. But it adds up. And it will add up more. The scale in which it matters never ceases to amaze me. 2.5M unique monthly users. Real polyglot support with bindings for python, java, .NET, javascript, ruby, php... 16M Python downloads. 98.5M Java downloads. 

It takes a village, and I am showing up with you all, inviting you along. 

Thursday, June 20, 2024

Good Enough Quality Is Taking a Dip

A colleague in the community was reporting on her experience teaching tech for a group of kids, and expressing how tech makes it hard to love tech sometimes, and her post pushed me to think about it. This is the experience a lot of us have. 

We want to show a newbie group on how to run with a lovely new tool, only to realize that to run the new tool, we first have to go round whatever troubles come our way. Some of the troubles come from the differences of our environments. Some from the fact that no one updates and maintains some computers. Some from the choices of features (ads in particular) that get in the way. It's not enough if the thing we're building works, but the operating system underneath and even antivirus running in the background are equally part of the users' experience. 

Those of us teaching in tech have found workarounds for environmental differences, my favorite being to work on either my computer through ensemble programming, or if I want to teach newbies for real, working on their computers and making time to solve real life problems they will face for things being different. 

Quality, while good enough, is currently not great. And it has not only been getting better in the recent years, even if some people speak of AI in terms of it being today the worst it will ever be. 

Looking at AI-enabled products, very few of them are really making things better. 

I was purchasing a lot of books during my vacation time from Amazon, and seeing the AI-generated summary posts on top of reviews wasn't helping me with my selections. I went for #booktok as AI renders the first recommendation pretty much useless to me - I don't want summary, I want to find someone like me. 

Trying out new AI-powered search from Google wasn't producing better results, and I am pretty sure they too know that the AI-powered one is doing worse than the classic one. 

Apple just announced introducing - finally - some AI-powered features with chatGPT integration, looking like a placeholder to grow worthwhile features in, but other than that giving appearance of little usefulness to the product. 

Everyone seeks uses of AI. Experiments with AI. I do too. I would still report than in the last two years of actively having used AI-tools, it has not added to my productivity, but it has added sometimes to my quality (I'm really good at not being happy with AI's results) and it has added to my fun and enjoyment. 

Current uses of AI are driving down experience we have on good enough quality. To include a placeholder for AI to grow - assuming it is only getting better - we make quite significant compromises in the value our products and services provide. The goal for AI-placeholders is not to make things better today. It is to make space to make things better soon, and meanwhile we may take a dip in the experience of relevance for users. 

For years, I have come to note that truly seeking good quality is not common. Our words to explain the level we are targeting or achieving are hidden behind the fuzzy term "quality". Even bad quality is quality. Making things fun and entertaining is quality. 

Our real target may be usefulness and productivity. Being able to do things with tech we weren't able to do before, and being able to do things we really need to do. While we find our way there with GenAI, we're going to experience a dip on good enough quality to allow for large scale experimentation on us.  

 

Thursday, June 6, 2024

GenAI Pair Testing

This week, I got to revisit my talk on Social Software Testing Approaches. The storyline is pretty much this: 

  • I was feeling lonely as the only tester amongst 20 developers at a job I had 10 years ago. 
  • I had a team of developers who tested but could only produce results expected from testing if they looked at my face sitting next to them, even if I said nothing. I learned about holding space. 
  • I wanted to learn to work better on the same tasks. Ensemble programming became my gateway experience to extensive pair testing beyond people I vibe with, and learning to do strong style pairing was transformative. 
  • People are intimidating, you feel vulnerable, but all of the learning that comes out is worth so much.
As I was going through the preparations for the talk, I realized something has essentially changed since I delivered that talk last time. We've been given an excuse to talk to a faceless, anonymous robot in form of generative AI chatbots. The success I had described as essential to strong-style pairing (expressing intent!) was now the key differentiator on who good more out of the tooling that was widely available. 

I created a single slide to start the conversation on something that we had been doing: pairing with the application, sometimes finding it hard to not humanize a thing that is a stochastic parrot, even if a continuously improving predictive text generator. Realizing that when pairing with genAI, I was always the navigator (brains) of the pair, and the tool was the driver (hands). Both of us would need to attend to our responsibilities but the principle "an idea from my head must go through someone else's hands" is a helpful framing.


I made few notes on things I at least would need to remember to tell people about this particular style of pairing. 

External imagination. Like with testing and applications, we do better when we look at the application while we think about what more we would need to do with the application, genAI acts as external imagination. We are better at criticizing something when we see it. With a testing mindset, we expect it to be average, less than perfect, and our job is to help it elevate. We are not seeking boring long clear sentences, we are seeking the message core. We search boundaries, arguing with the tool on different stances and assumptions. We recognize insufficiency and fix it with feedback, because average of the existing base of texts is not *our* goal. We feel free to criticize as it's not a person with feelings, taking possibly offense when we are delivering the message on the baby being ugly. We dare to ask things, in ways we wouldn't dare to ask from a colleague. We're comfortable wasting our own time, but uncomfortable taking up space of others days. 

Confidentiality.We need to remember that it does not hold our secrets. Anything and everything we ask is telemetry, and we are not in control of what conclusions someone else will draw on our inputs. When we have things we would need to say or share that we can't share outside our immediate circle, we really can't post all that for these genAI pairs. But for things that aren't confidential, it listens to more than your colleague in making its summaries and conclusions. And there is an option of not using the cloud-based services but hosting a model of your own from Hugging Face, where whatever you say never leaves your own environment. Just be aware. 

Ethical compensations. Using tools like this changes things. The code generation models being trained on open source code change the essential baseline of attribution that enables many people to find things that allow them the space of contributing to open source. These tools strip names of people and the sense of community around the solutions. I strongly believe we need to compensate. At my previous place we made three agreements on compensation: using our work time on open source; using our company money to support open source; and contributing to body of research not the change these tools bring about.  Another ethical dilemma to compensate for is energy consumption of training models - we need to reuse not recreate, as one round of training is said to take up energy equivalent of car trip to moon and back. While calling for reuse, I am more inclined to build forward the community-based reuse models such as Hugging Face over centralizing information to large commercial bodies with service conditions giving promises what they will do with out data. And being part of underrepresented group in tech, there's most definitely compensations needed on the bias embedded in the data to create world we could be happy with. 

Intent and productivity. Social software testing approaches have given me ample experiences of using tools like this with other people, and seeing them in action with genAI pairing. People who pair well with people, pair better with the tools. People who express intent in test-driven development well and clearly, get better code generated from these tools. The world may talk of prompt engineering but it seems to be expressing intent. Another note is on the reality of looking at productivity enhancements. People insist they are more productive, but a lot of the deeper investigations show that there is creative bookkeeping involved. Like fixing a bug takes 10 minutes AFTER you know exactly which line to change, and your pipelines just work without attending. You just happen to use a day in finding out the line to change, and another on care of the pipeline on whatever random happened on it being on your watch. 

These tools don't help me write test cases, write test automation and write bug reports. They help me write something of my intent and choosing which is rarely test case or automation, they help me summarize, understand, recall from notes, ideate actions and oracles, scope to today / tomorrow - this environment / the other - the dev / the tester, prioritize and filter, and inspire. They help me more when I think of testing activities not as planning - test case design - execution - reporting, but as intake - survey - setup - analysis - closure. Writing is rarely the problem. Avoiding writing and discovering information that exists or doesn't, that is more of a problem. Things written are a liability, someone needs to read them for writing them to be useful. 

Testing has always been about sampling, and it has always been risk-based. When we generate to support our microdecisions, we sample to control the quality of outputs. And when things are too important to be left for non-experts, we learn to refer them to experts. If and when we get observability in place, we can explore patterns after the fact, to correct and adjust. 

I still worry about the long term impacts of not having to think, but I believe criticizing and thinking is now more important than ever. The natural tendency of agreeing and going with the flow has existed before too. 

And yes, I too would prefer AI to take care of housekeeping and laundry, and let me do the arts, crafts and creative writing. Meanwhile, artificial intelligence is assisting intelligence. We could use some assisting day to day.