Friday, January 6, 2023

Contemporary Regression Testing

One of the first things I remember learning about testing is the repeating nature of it. Test results are like milk and stay fresh only a limited time, so we keep replenishing our tests. We write code and it stays and does the same (even if wrong) thing until changes, but testing repeats. It's not just the code changing that breaks systems, it's also the dependencies and platform changing, people and expectations changing. 

An illustration of kawaii box of milk from my time of practicing sketchnoting

There's corrective, adaptive, perfective and preventive maintenance. There's the project and then there's "maintenance". And maintenance is 80% of products lifecycle costs since maintenance starts with first time you put the system in production. 

  • Corrective maintenance is when we had problems and need to fix them.
  • Adaptive maintenance is when we will have problems if we allow for world around us to change and we really can't stop it, but we emphasize that everything was FINE before the law changes, the new operating system emerged or that 3rd party vendor figured out they had a security bug that we have to react to because of a dependency we have.
  • Perfective maintenance is when we add new features while maintaining, because customers learn what they really need when they use systems. 
  • Preventive maintenance is when we foresee adaptive maintenance and change our structures so that we wouldn't always be needing to adapt individually. 

It's all change, and in a lot of cases it matters that only the first one is defects and implying work you complete without invoicing for the work. 

The thing about change is that it is small development work, and large testing work. This can be true considering the traditional expectations of projects:

  1. Code, components and architecture are spaghetti
  2. Systems are designed, delivered and updated as integrated end-to-end tested monoliths
  3. Infrastructure and dependencies are not version controlled
With all this, the *repeating nature* becomes central, and we have devised terminology for it. There is re-testing (verifying a fix indeed fixed the problem) and regression testing (verifying that things that used to work still work), and made it a central concept in how we discuss testing.

For some people, it feels regression testing is all the testing they think of. When this is true, it almost makes sense to talk about doing this manual or automated. After all, we are only talking of the part of testing that we are replenishing results for. 

Looking at the traditional expectations, we come to expectations of two ways to think about regression testing. One takes a literal interpretation of "used to work", as in we clicked through exactly this and it worked, and I would call this test-case based regression testing. The other takes a liberal interpretation of "used to work" remembering that with risk-based testing we never looked at it all working but some of it worked even when we did not test it, and thus continuing with risk-based perspective, the new changes drive entirely new tests. I would call this exploratory regression testing. This discrepancy of thinking is a source of a lot of conversation in automated space because the latter would need to actively choose to pick tests as output to leave behind that we consider worthwhile repeating - and it is absolutely not all the tests we currently are leaving behind. 

So far, we have talked in traditional  expectations. What is contemporary expectation then?

The things we believe are true of projects are sometimes changing:
  1. Code is clean, components are microservices and architecture creates clear domain-driven architecture where tech and business concepts meet
  2. Systems are designed, delivered and updated incrementally, but also per service basis
  3. Infrastructure and dependencies are code
This leads to thinking many things are different. Things mostly break only when we break them with a change. We can see changes. We can review the change as code. We can test the change from a working baseline, instead of a ball of change spaghetti described in vague promises of tickets. 

Contemporary regression testing can more easily rely on exploratory regression testing with improved change control. Risk-based thinking helps us uncover really surprising side effects of our changes without using major efforts. But also, contemporary exploratory testing relies on teams doing programmatic test-case based regression testing whenever it is hard for developers to hold their past intent in their heads. Which is a lot, with people changing and us needing safety nets. 

Where with traditional regression testing we could choose one or the other, with contemporary regression testing we can't.   

Monday, January 2, 2023

The Three Cultures

Over the last 25 years, I have been dropped to a lot of projects and organizations. While I gave up on consulting early on and deemed it unsuited for my aspirations, I have been a tester with an entrepreneurial attitude - a consultant / mentor / coach even within the team I deliver as part of. 

Being dropped to a lot of projects and organizations, I have come to accept that two are rarely the same. Sometimes the drop feels like time travel to past. Rarely it feels like time travel to future. I find myself often brought in to help with some sort of trouble, or if there was no trouble, I can surely create some like with a past employer where we experimented with no product owner. There was trouble, we just did not recognise it without breaking away from some of our strong-held assumptions. 

I have come to categorize the culture, the essential belief systems around testing to three stages:

  1. Manual testing is the label I use for organizations predominantly stuck in test case creation. They may even automate some of those test cases, usually with the idea speeding up regression testing, but majority of what they do relies on the idea that testing is predominantly without automation, for various reasons. Exploratory testing is something done on top of everything else. 
  2. Automated testing is the label I use for organizations predominantly stuck in spearing manual and automated testing. Automated testing is protected from manual testing (because it includes so much of its own kind of manual testing), and the groups doing automation are usually specialists in test automation space. The core of automated testing is user interfaces and mostly integrated systems, something a user would use. Exploratory testing is something for the manual testers. 
  3. Programmatic tests is the label I use for whole team test efforts that center automation as a way of capturing developer intent, user intent and past intent. Exploratory testing is what drives the understanding of intent. 
The way we talk, and our foundational beliefs in these three different cultures just don't align. 

These cultures don't map just to testing, but the overall ideas of how we organize for software development. For the first, we test because we can't trust. For the middle, we test because we are supposed to. For the last, we test because not testing threatens value and developer happiness. 

Just like testing shifts, other things shift too. The kind of problems we solve. The power of business decisions. Testing (in the large) as part of business decisions. The labels we use for our processes in explaining those to the world. 

This weekend I watched an old talk from Agile India, by Fred George on 'Programmer Anarchy'. I would not be comfortable taking things to anarchy, but there there is a definite shift in where the decision power is held, with everyone caring for business success in programmer-centric ways of working. 

The gaps are where we need essentially new cultures and beliefs accepted. Working right now with the rightmost cultural gap, the ideas of emergent design are harder to achieve than programmed tests. 

Documentation is an output and should be created at times we know the best. Programmed tests are a great way of doing living documentation that, used responsibly, gives us a green on our past intent in the scope we care to document it. 

Friday, December 30, 2022

Baselining 2022

I had a busy year at work. While working mostly with one product (two teams), I was also figuring out how to work with other teams without either taking too much to become a bottleneck or to take so little I would be of no use. 

I got to experience by biggest test environment to this date in one of the projects - a weather radar. 

I got to work with a team that was replacing and renewing, plus willing and able to move from test automation (where tests are still isolated and entered around a tester) to programmatic tests where tests are whole team asset indistinguishable from other code. We ended up this year finalising the first customer version and to add 275 integrated tests and 991 isolated tests to best support our work - and get them to run green in blocking pipelines. The release process throughput times went from 7 hours from last commit to release to 19 minutes from last commit to release, and the fixing stage went from 27 days to 4 days. 

I became the owner of testing process for a lot of projects, and equally frustrated on the idea of so many owners of slices that coordination work is more than the value work. 

I volunteered with Tivia (ICT in Finland) as a board member for the whole year, and joined Selenium Open Source Project Leadership Committee early autumn. Formal community roles are a stretch, definitely. 

I got my fair share of positive reinforcement on doing good things for individuals salaries and career progression, learning testing and organising software development in smart ways some might consider modern agile. 

I spoke at conferences and delivered trainings, total 39 sessions out of which 5 were keynotes. I added a new country on my list of appearances, totalling now at 28 countries I have done talks in. I showed up at 7 podcasts as guest. 

I thought I did not write much into my blog, and yet this post is 42nd one of this year on this blog, and I have one #TalksTurnedArticles on and one article in IT Insider. My blog has now 840 222 hits, which is 56 655 more than a year ago. 

I celebrated my Silver Jubilee (25 years of ICT career) and started out with a group mentoring experiment of #TestingDozen. 

I spent monthly time reflecting and benchmarking with Joep Schuurkes, did regular on-demand reflections with Irja Strauss, ensemble programmed tests regularly with Alex Schladebeck and Elizabeth Zagroba and met so many serendipitous acquaintances from social media that I can't even count it. 

I said goodbye to twitter, started public note taking at mastodon and irregularly regular posting on LinkedIn. I am there (here) to learn together. 

Wednesday, December 28, 2022

A No Jira Experiment

If there is something I look back to from this year, it is doing changes that are impossible. I've been going against the dominant structures, making small changes and enabling a team where continuous change, always for the better is taking root. 

Saying it has taken root would be overpromising. Because the journey is only in the beginning. But we have done something good. We have moved to more frequent releases. We have established programmatic tests (over test automation), and we have a nice feedback cycle that captures insights of things we miss into those tests. The journey has not been easy.

When I wanted to drop scrum as the dominant agile framework of how management around us wanted things planned early this year, the conversations took some weeks up until a point of inviting trust. I made sure we were worthy of that trust, collecting metrics and helping the team hit targets. I spent time making sure the progress was visible, and that there was progress. 

Yet, I felt constrained. 

In early October this year, I scheduled a meeting to drive through yet another change I knew was right for the team. I made notes of what was said.

This would result in:

  • uncontrolled, uncoordinated work
  • slower progress, confusion with priorities and requirements
  • business case and business value forgotten 

Again inviting that trust, I got to do something unusual for the context at hand: I got to stop using Jira for planning and tracking of work. The first temporary yes was four weeks, and those were four of our clearest, most value delivering weeks. What started off as temporary, turned to *for now*. 

Calling it No Jira Experiment is a bit of an overpromise. Our product owner uses Jira on epic level tickets and creates this light illusion of visibility to our work in that. Unlike before, now with just few things he is moving around, the statuses are correctly represented. The epics have a documented acceptance criteria by the time they are accepted. Documentation is the output, not the input. 

While there is no tickets of details, the high level is better.

At the same time, we have just completed our most complicated everyone involved feature. We've created light lists of unfinished work, and are just about to delete all the evidence of the 50+ things we needed to list. Because our future is better when we see the documentation we wanted to leave behind, not the documentation that presents what emerged while the work was being discovered. The commit messages were more meaningful representing the work now done, not the work planned some weeks before. 

It was not uncontrolled or uncoordinated. It was faster, with clarity of priorities and requirements. And we did not forget business case or business value. 

It is far from perfect, or even good enough for longer term, we have a lot to work on. But it is so much better than following a ticket-centric process, faking that we know the shape of the work forcing an ineffective shape because that seems to be expected and asked for. 

Tuesday, December 6, 2022

There is such a thing as testing that is not exploratory

The team had been on a core practice of clarifying with tests for a while, and they invited an outsider to join their usual meeting routine.

Looking around, there were 8 people in a call. The one in charge of inviting the meeting shared their screen, for what was about to be their routine of test design sessions. He copied the user story they had been assigned to work on into the Jira ticket he had open, and called the group for ideas of tests. 

People started mentioning things that could be tried, and the one in charge wrote things down as the rest of the group watched. For a high level idea, he called for the detailed steps. Every idea that came about was written down and detailed. 

After the meeting, the group would split the work to automate it all. 

And two weeks later, they had all these tests passing, and a feature that just did not work.

The magic glue they miss is what I call exploratory testing. The sense of caring for results in testing by focusing on learning and recognising that the time most people create test cases like above, it is the time when they know the least.

You can add exploratory testing on top of this.

You can lighten up the details you want to write to start with to leave more room for exploratory testing, where output of your testing is the documentation. 

You can seek to optimize for learning, being aware of the time used.

The team that follows the planning pattern did not do exploratory testing. You could argue the next team using their results did exploratory testing through trying to use it, to tell them it is failing. 

Working with teams like this is real. It is embarrassing but it is real. And we don't change it by playing with words, but by making the results we expect clear and permissive to excellence. 

This story, unfortunately, was brought to you by the "tell no stories that did not happen" and "only stories from last two years allowed" rules. This is real. This is what people still make of testing, and some folks meekly follow the agreed practice.  

Monday, December 5, 2022

Jokes and Poetry from a Stochastic Parrot

Tell a joke about a programmer and a tester attempting to enter Heaven and having to confess their sins.

Two programmers were on their way to Heaven and they came to a fork in the road. One path led to Heaven, the other to Hell. The programmer who had been a tester in life said, "I'm not taking any chances. I'm going to test both paths to see which one is safe." The programmer who had been a developer in life said, "You fool! We don't have time for that. We need to take one path and trust that it leads to Heaven." The tester replied, "Trust, but verify."


A programmer and a tester were standing at the Gates of Heaven, waiting to confess their sins. The programmer went first and said, "I wrote code that I knew was buggy and released it anyway." The tester said, "I found the bugs in that code and didn't report them." The Gates of Heaven opened, and the programmer was welcomed inside. The tester was shocked and asked, "Why was he allowed in and not me?" The gatekeeper replied, "We already have plenty of bugs in Heaven. What we need are more testers."


A programmer and a tester were arguing about whether it was better to be a programmer or a tester. The programmer said, "Programmers are the ones who create the software that makes the world go round. Without us, there would be no progress or innovation." The tester replied, "That may be true, but without testers, all of your buggy code would never be caught and fixed. Testers are the ones who ensure that the software works properly and is of high quality." The programmer thought for a moment and then said, "I suppose you have a point. I guess it's a good thing that programmers and testers need each other to make the world a better place."

Make it more funny.

A programmer and a tester were on their way to Heaven, and they came across a sign that said "Heaven: 10,000 miles." The programmer looked at the sign and said, "I don't believe it. They've made a mistake. Heaven can't be that far away." The tester said, "Of course it is. They've measured the distance with a ruler, and it's accurate to within plus or minus an inch." The programmer rolled his eyes and said, "That's why I always use double-precision floating-point numbers. They're much more precise." The tester replied, "Yes, but they can't measure the distance to Heaven. Only faith can do that."

Ok, just one more.  

A programmer and a tester were standing in line at the Gates of Heaven, waiting to confess their sins. The programmer went first and said, "I wrote code that was full of bugs and never tested it." The tester said, "I tested the code and found the bugs, but I never reported them." The Gates of Heaven opened, and the programmer was allowed inside. The tester was shocked and asked, "Why was he allowed in and not me?" The gatekeeper replied, "We have plenty of bugs in Heaven. What we need are more testers to help us find and fix them."

You already told that, I want a new one.

I'm sorry, I don't have access to new information at this time. I am a large language model trained by OpenAI, and my knowledge is limited to what I was trained on. I don't have the ability to browse the internet or access new information, so I can only provide responses based on what I already know. Is there something else I can help you with?

One more joke about testers and programmers, please. 

A programmer and a tester were at a restaurant, and the waiter brought them a menu. The programmer looked at the menu and said, "I don't see any bugs on this menu. This must be a high-quality restaurant." The tester replied, "That's what you think. I've already found three bugs on the menu: the prices are incorrect, the descriptions are misleading, and the options are limited. I'm not sure I want to eat here." The programmer laughed and said, "I guess that's why we need both programmers and testers. Without us, the world would be full of bugs and errors."

 Give me a poem about exploratory testing.  

Thank you, OpenAI ChatGPT.  The tools we use are changing the world in ways we don't yet understand. Whose jokes and poems are these, or are these indeed "AI originals"? 

On more serious note, you should see what it does with advent of code day 5 on day 5. A separate leaderboard for AI and humans does not quite seem like the solution. 

Friday, November 18, 2022

1-2-3 model to test coverage

This afternoon I jumped on a call with a colleague from the community at large. This one had sent me a LinkedIn message asking to talk about test coverage, and our previous correspondence was limited. And like I sometimes do, I said yes to a discussion. After the call, I am grateful. For realizing there is a 1-2-3 model of how I explain test coverage, but also for the conversation channel that helps me steer to understanding, starting from where ever whoever is. 

The 1-2-3 model suggests there is one true measure of test coverage. Since that is unattainable, we have two we commonly use as starting point. And since the two are so bad, we need to remember three more to be able to explain further to people who may not understand the dimensions of testing. 

The One

There is really one true measure of coverage, and it is that of risk/results coverage. Imagine a list of all relevant and currently true information about the product that we should have a conversation on listed on a paper - that is what you are seeking to cover. The trouble is, the paper when given to you is empty. There is no good way of creating a listing of all the relevant risks and results. But we should be having conversation on this coverage, here is how. 

If you are lucky and work in a team where developers truly test and care for quality, the level of coverage in this perspective is around the middle line in the illustration below. That is a level of quality information produced by a Good Team (tm). The measure determining if we indeed are with a Good Team (tm) is sending someone Really Good at testing after them. That Really Good could be a tester, but I find that most testers find themselves out of jobs with good teams - the challenge level is that much higher. Or that Really Good could be all your users combined over time, with an unfortunate delay in feedback and higher risk of the feedback being lost in translation. 

I call the difference between the output for a Good Team (tm) and the quality where our stakeholders are really happy, even delighted the primary Results Gap. There are plenty of organizations who are not seeking to do anything with this results gap themselves but leave it to their users. That is possible, since the nature of the problems people find within the primary results gap is a surprise. 

I recognise if I am working with a team in this space by being surprised with problems. Sometimes I even exclaim: "this bug is so interesting that no one could have created this on purpose!". Consider yourself lucky if you get to work with a team like this that remains this way over time. After all, location on this map is dynamic with regards to consistently doing a good work across different kinds of changes. 

There is a secondary Results Gap too. Sometimes the level to which teams of developers get to is Less than Good Team's Output. We usually see this level with organizations where managers hire testers to do testing, even when they place the tester in the same team. Testing is too important to be left just for testers, and should be shared variably between different team members. Sometimes working as tester in these teams feels like your job is to point out that there are pizza boxes in the middle of the living room floor and remind that we should pick them up. Personally when I recognise the secondary results gap, I find the best solution is to take away the tester, reorganize quality responsibilities on the remaining developers. The job of a tester in a team like this is move the team to the primary results gap, not deal with the pizza boxes except for temporarily as protection of the reputation of the organization. 

A long explanation on the one true measure of coverage - risks/results. Everything else is an approximation subject to this. It helps to understand if we are operating with a team on the secondary results gap or with a team on the primary results gap, and the lower we start, the less likely we are ever to get to address all of the gap. 

The Two

The two measures of coverage we commonly use and thus everyone needs to understand are code coverage and requirements/spec coverage. These are both test coverage, but very different by their nature. 

Code coverage can only give us information of what is in the code and whether the tests we have touch it. If we have functionality we promised to implement, that users expect to be included but we are missing out on, that perspective will not emerge with code coverage. Code coverage focuses on the chances of seeing what is there. 

Cem Kaner has an older article of 101 different criteria in the space of code coverage, so let's remember it is not one thing. There are many ways we can look at the code and discuss having seen it in action. Touching each line is one, taking every direction at every crossroad is one, and paying attention to the complex criteria of the crossroads is one. Tools are only capable of the simpler ways of assessing code coverage. 

Seeing a high percentage does not mean "well tested". It means "well touched". Whether we looked at the right things, and verified the right details is another question. Driving up code coverage does not usually mean good testing. Whereas being code coverage aware, not wanting code coverage to go down from where it has been even when adding new functionality, and taking time for thoughtful testing based on code coverage seem to support good teams in being good. 

Requirement/spec coverage is about covering claims in authoritative documents. Sometimes requirements need to be rewritten as claims, sometimes we go about spending time with each claim we find, and sometimes we diligently link each requirement to one or more tests, but some form of this tends to exist. 

With requirements/spec coverage, we need to be aware that there are things the spec won't say and we still need to test for. We can never believe any material alone is authoritative, testing is about also discovering omissions. Omissions can be code that spec promises, or details spec fails at promising but users and customers would consider particularly problematic. 

Having one test for a claim is rarely sufficient. There is no set number of tests we need for each claim. So I prefer thinking in none / one / enough. Enough is about risk/results. And it changes from project to project, and requires us to be aware of what we are testing to do a good job testing. 

The Three

By this time, you may be a little exasperated with the One and the Two, and there is still the Three. These three are dimensions of coverage I find I need to explain again and again to help address the risks. 

Environment coverage starts with the idea that users environments are different and testing in one may not represent them all. We could talk for hours on what makes environments essentially different, but for purposes of coverage, take my word for it: sometimes they are and sometimes they are not essentially different. So for the 10 functionalities to cover with tests with one test for each functionality, if we have three environments, we could have 30 tests to run. 

Easy example is browsers. Firefox on Linux is separate from Firefox on Mac and Firefox on Windows. Safari on Mac or Edge on Windows is only available there. Chrome is available on Mac, Windows and Linux. That small listing alone gives us 8 environments. The amount of testing - should we want to do it regularly - could easily explode. We may address this with various strategies from having different people on different environments, changing environments on a round-robin fashion to cross-browser automation. Whether we care to depends on risks, and risks depends on the nature of the thing we are building. 

Data coverage starts with the idea that each functionality processing data is covered with one data, but that may be far from sufficient. Like with embedded devices over the last three years, I find it surprising how often covering such simple thing as positive and negative temperature is necessary with the registry manipulation technologies. For this coverage, we would heavily rely on sampling, and it is part of every requirement test making it flexible to consider what percentage we are getting. Well, at least enough to note that percentages are generally useless measures in space of coverage.

Parafunctional coverage would be reminding on other dimensions that positive outputs. Security would be to have functionality that does something that can be used in wrong hands for bad. Performance would be considerations of fast and resource effective, particularly now in era of green code considerations. Reliability would be to run same things over longer period of time. And so on. 

Plus One

Today's call concluded with us then discussing automation coverage. Usually what we end up putting in our automation is a subset of all the things we do, a subset we want to keep on repeating. Great automation isn't created from listing test cases and implementing them, for good automation we tend to decompose the feedback needed differently where sum of the whole is similar. 

Automation coverage is ratio of what we have automated to something we care for. Some people care for documented test cases but I don't. If and when I care about this, I talk about automation coverage in terms of plans of growing it, and I avoid the conversation a lot. 

In one project we did test automation coverage by assigning zero, one, enough values for requirements by tagging all automation with requirements identifiers. A lot of work, some good communications included on planning for what more we need (first), but the percentage was very much the same as I could estimate off cuff. 

You may not have the half an hour it took for us to discuss 1-2-3 on the call, but knowing how to ground conversations of coverage is invaluable skill. If you spend time with testing, you are likely to get as many chances of practicing this conversation that I have by now.