Posts tagged ‘Approval Testing’

By Emily Bache

There’s a frank discussion going on in the software industry at the moment about the words we use and the history behind them. Perhaps now is a good time to reconsider some of our terminology. For example, I’ve noticed we have several terms that describe essentially the same kind of testing:

  • Golden Master
  • Snapshot
  • Characterization
  • Approval

I think it’s time to completely drop the first one of these. In addition, if we could all agree on just one term it could make communication easier. My preferred choice is ‘Approval Testing’. As an industry, as a community of software professionals, can we agree to change the words we use?

What kind of testing are we referring to?

The common mechanism for ‘Golden Master’, ‘Snapshot’, ‘Characterization’ and ‘Approval’ testing is that you run the software, gather the output and store it. The combination of (a) exactly how you set up and ran the software and (b) the stored output, forms the basis of a test case. 

When you subsequently run the software with the same set up, you again gather the output. You then compare it against the version you previously stored in the test case. Any difference fails the test.

There are a number of testing frameworks that support this style of testing. Some open source examples:

 Full disclosure: I am a contributor to both Approvals and TextTest.

Reasons for choosing the term ‘Approval Testing’

Test cases are designed by people. You decide how to run the software and what output is good enough to store and compare against later. That step where you ‘approve’ the output is crucial to the success of the test case later on. If you make a poor judgement the test might not contain all the essential aspects you want to check for, or it might contain irrelevant details. In the former situation, it might continue to pass even when the software is broken. In the latter situation, the test might fail frequently for no good reason, causing you to mistrust or even ignore it. 

I like to describe this style of testing with a term that puts human design decisions front and center.

Comments on the alternative terms


This term draws your attention to the fact that the output you have gathered and stored for later comparison in the test is transient. It’s correct today, but it may not be correct tomorrow. That’s pretty agile – we expect the behaviour of our system to change and we want our tests to be able to keep up. 

The problem with this term is that it doesn’t imply any duty of care towards the contents of the snapshot. If a test fails unexpectedly I might just assume nothing is wrong – my snapshot is simply out of date. I can replace it with the newer one. After all, I expect a snapshot to change frequently. Did I just miss finding a bug though?

I prefer to use a word that emphasizes the human judgement involved in deciding what to keep in that snapshot.


This is a better term because it draws your attention to the content of the output you store: that it should characterize the program behaviour. You want to ensure that all the essential aspects are included, so your test will check for them. This is clearly an important part of designing the test case. 

On the other hand, this term primarily describes tests written after the system is already working and finished. It doesn’t invite you to consider what the system should do or what you or others would like it to do. Approval testing is a much more iterative process where you approve what’s good enough today and expect to approve something better in the future.

Golden Master

This term comes from the record industry where the original audio for a song or album was stored on a golden disk in a special archive. All the copies in the shops were derived from it. The term implies  that once you’ve decided on the correct output, and stored it in a test, it should never change. It’s so precious we should store it in a special ‘golden’ archive. It has been likened to ‘pouring concrete on your software’. That is the complete opposite of agile! 

In my experience, what is correct program behaviour today will not necessarily be correct program behaviour tomorrow, and we need to update our understanding and our tests. We need to be able to ‘approve’ a new version of the output and see that as a normal part of our work.

This seems to me to be a strong enough argument for dropping the term ‘Golden Master’. If you’ve been following the recent announcement from Github around renaming the default branch to ‘main’, you’ll also be aware there are further objections to the term ‘master’. I would like to be able to communicate with all kinds of people in a respectful and friendly manner. If a particular word is problematic and a good alternative exists, I think it’s a good idea to switch.

In conclusion

Our job is literally about writing words in code and imbuing them with meaning. Using the same words to describe the same thing helps everyone to communicate better. Will you please join me in using the words ‘Approval Testing’ as an umbrella term referring to a particular style of testing? Words matter. We should choose them carefully. 

By Emily Bache

Or: is Given-When-Then Compulsory?

In BDD you discover what software you should build through a collaborative process involving both software developers and business people. BDD also involves a lot of test automation and tools like Cucumber and SpecFlow. But what would happen if you used an Approval testing tool instead? Would that still be BDD?

Double-loop TDD diagram. Failing scenario -> passing scenario -> refactor and inner loop red->green->refactor

Figure 4 from “Discovery – Explore behaviour using examples” by Gaspar Nagy and Seb Rose

I’m a big fan of Behaviour Driven Development. I think it’s an excellent way for teams to gain a good understanding of what the end-user wants and how they will use the software. I like the emphasis on whole team collaboration and building shared understanding through examples. These examples can be turned into executable scenarios, also known as acceptance tests. They then become ‘living documentation’ that stays in sync with the system and helps everyone to collaborate over the lifetime of the software. 

I wrote an article about Double-Loop TDD a while back, and I was thinking about BDD again recently in the context of Approval testing. Are they compatible? The usual tools for automating scenarios as tests are SpecFlow and Cucumber which both use the Gherkin syntax. Test cases comprise ‘Given-When-Then’ steps written in natural language and backed up by automation code. My question is – could you use an Approval testing tool instead? 

I recently read a couple of books by Nagy and Rose. They are about BDD and specifically how to discover good examples and then formulate them into test cases. I thought the books did a good job of clearly explaining these aspects in a way that made them accessible to everyone, not just programmers. 

Nagy and Rose are planning a third book in the series which will be more technical and go into more detail on how to implement the automation. They say that you can use other test frameworks, but in their books they deal exclusively with the Gherkin format and Cucumber family of tools. What would happen if you used an Approval testing tool? Would it still be BDD or would we be doing something else? Let’s go into a little more detail about the key aspects of BDD: discovery, formulation, and automation.


The discovery part of BDD is all about developers talking with business stakeholders about what software to build. Through a structured conversation you identify rules and examples and unanswered questions. You can use an ‘example mapping’ workshop for that discussion outlined in this blog post by Cucumber Co-founder, Matt Wynne.


The formulation part of BDD is about turning those rules and examples of system behaviour into descriptive scenarios. Each scenario is made as intelligible as possible for business people, consistent with the other scenarios, and unambiguous about system behaviour. There’s a lot of skill involved in doing this!


The automation part of BDD is where you turn formulated scenarios into executable test cases. Even though the automation is done in a programming language, the focus is still on collaboration with the business stakeholders. Everyone is expected to be able to read and understand these executable scenarios even if they can’t read a programming language.  

Double-Loop TDD

The picture shown at the start of the article from Nagy and Rose’s Discovery BDD book emphasizes the double loop nature of the BDD automation cycle. The outer loop is about building the supporting code needed to make a formulated scenario executable. Test-Driven Development fits within it as the inner loop for implementing the system that fulfills the scenarios. In my experience the inner loop of unit tests goes round within minutes, whereas the outer loop can take hours or even days.  

Later in the book they have a more detailed diagram showing an example BDD process:

Figure 16  from “Discovery – Explore behaviour using examples” by Gaspar Nagy and Seb Rose

This diagram is more complex, so I’m not going to explain it in depth here (for a deep dive take a look at this blog post by Seb Rose, or of course read the book itself!). What I want to point out is that the ‘Develop’ and ‘Implement’ parts of this diagram are showing double-loop TDD again, with slightly more detail than before. For the purpose of comparing a BDD process, with and without Approval testing, I’ve redrawn the diagram to emphasize those parts:

How you formulate, automate, and implement with TDD will all be affected by an approval testing approach. I recently wrote an article ”How to develop new features with Approval Testing, Illustrated with the Lift Kata”. That article goes through a couple of scenarios, how I formulate them as sketches, then automate them with an approval testing tool. Based on the process described in that article I could draw it like this:

What’s different?

  • “Formulate” is called “Sketch” since the method of formulation is visual rather than ‘Given-When-Then’. The purpose is the same though.
  • “Automate” includes writing a Printer as well as the usual kind of ‘glue’ code to access functionality in your application. A Printer can print the state of the software system in a format that matches the Sketch. The printer code will also evolve as you work on the implementation.
  • “Implement” is a slightly modified TDD cycle. With approval tests you still work test-driven and you still refactor frequently, but other aspects may differ. You may improve the Printer and approve the output many times before being ready to show the golden master to others for review.
  • “Review” – this activity is supposed to ensure the executable scenario is suitable to use as living documentation, and that business people can read it. The difference here is that the artifact being reviewed is the Approved Golden Master output, not the sketch you made in the “Formulate” activity. It’s particularly important to make sure business people are involved here because the living documentation that will be kept is a different artifact from the scenario they co-created in the ‘discover’ activities.

But is this still BDD?

I’m happy to report that, yes, this is still BDD! I hope you can see the activities are not that different. Just as importantly, the BDD community is open and welcoming of diversity of practice. This article describes BDD practitioners as forming a ‘centered’ community rather than a bounded community. That means people are open to you varying the exact practices and processes of BDD so long as you uphold some common values. The really central part of BDD is the collaborative discovery process.

In this article I hope I’ve shown that using an approval testing approach upholds that collaborative discovery process. It modifies the way you do formulation, automation, and development, but in a way that retains the iterative, collaborative heart of BDD. For some kinds of system sketches and golden masters might prove to be easier for business people to understand than the more mainstream ‘Given-When-Then’ Gherkin format. In that case an approval testing tool might enable a better collaborative discovery process and propel you closer to the centre of BDD. 


BDD is about a lot more than test automation, and Gherkin is not the only syntax you can use for that part. Approval testing is perfectly compatible with BDD. I’m happy I can both claim to be a member of the BDD community and continue to choose a testing tool that fits the context I’m working in. 🙂 
If you’d like to learn more about Approval testing check out this video of me pair programming with Adrian Bolboaca.

When you inherit difficult code it can take weeks to become productive. Having the right tools for the job and knowing how to use them makes a huge difference. These videos show you how.

Note: this post originally appeared here

Sometimes you don’t know what a piece of code is supposed to do, but you do know that people are using it in production, and that it in some sense ‘works’. One approach I often use in this situation is Approval testing. It can get you test coverage quickly without having to understand the code.

Since you don’t know what the code is supposed to do, you can’t define in advance what results you expect. But, what you can do is run the code, accept whatever it does as ‘correct’, then invent scenarios that will exercise all the code branches. I’ve made a video of me doing just this on some rather hairy legacy code – The Gilded Rose Refactoring Kata. With the right tools the tests fall into place relatively easily.

I’d like to credit Llewellyn Falco who showed me this way to solve this exercise.

I recorded a screencast in three parts. This is the first part.

Part 1: Introducing the Gilded Rose kata and writing test cases using Approval Tests

About the Gilded Rose code

One of the exercises I’ve used for years to help programmers improve their skills is the Gilded Rose Kata. It’s a refactoring kata – the code needs cleaning up and tests adding so you can build a new feature. That is a realistic scenario that programmers often face in everyday work, but this exercise adds a fantasy twist. The code you have to work with keeps track of various magical items stocked at the Gilded Rose establishment. The new feature concerns support for “Conjured” items that have slightly different magical properties from the other items. The scenario is just weird enough to be fun and just realistic enough to be a useful exercise.

I didn’t design the kata originally, that was Terry Hughes. I spruced up the code a little to make it a better exercise and added some extra instructions to get you going. I also translated the starting code into a few different programming languages and put it up on GitHub. In the 5 years since then I have been delighted to see how popular it’s become. I’ve had over 50 contributors chip in with various translations and improvements, and at least 800 people have forked the project and presumably had a go at the refactoring.

I think the appeal of the exercise is partly the wacky scenario it throws you into, and partly how utterly terrible the code is at the start. If you do the refactoring well it actually looks really neat at the end, which is very satisfying.

Lift-Up Conditional

When you inherit difficult code it can take weeks to become productive. I’d like to show you the difference it can make when you have the right tools for the job and know how to use them.

Once you’ve got good tests in place you can refactor much more confidently. In my previous post I showed how to get good tests using Approval Testing. I’m pretty confident in these tests, so I’ve made a second video showing some initial refactorings I’d do to get this code cleaned up a little.

One of the techniques I’m using is called ‘Lift-Up Conditional’. It’s a manipulation of a long complex conditional statement that will let you group together all the statements related to one particular conditional. I haven’t seen this particular refactoring described in the literature before – it was Llewellyn Falco who showed it to me originally. It’s perfect for the Gilded Rose code which basically comprises one big complex conditional.

The other star of this show is IntelliJ. It has a lot of automated refactorings that come together to make ‘Lift-Up Conditional’ easy and it makes really short work of cleaning up this code.

This is the second screencast in the series. My aim is to show that with the right tools and refactoring know-how you can quickly become productive with the code, even without fully understanding the byzantine business rules.

Part 2: Refactoring item logic using ‘lift up conditional’

Replace Conditional with Polymorphism

When you inherit difficult code it can take weeks to become productive. I’d like to show you the difference it can make when you have the right tools for the job and know how to use them.

Once you’ve got the code cleaned up to the point where you can see the parts of the logic that belong together, you can start to create a better class structure. A classic refactoring for this is ‘Replace Conditional with Polymorphism’ which was first described in Martin Fowler’s book ‘Refactoring’.

The basic idea is that you create subclasses to encapsulate the logic concerned with each logical case. Your design becomes much more flexible if you need to add new types that are variations on types that are already there – as in this case.

This is the third screencast in the series. My aim is to show that with the right tools and refactoring know-how you can quickly become productive with this code, even without fully understanding the byzantine business rules.

Part 3: Replace Conditional with Polymorphism

I was in Finland recently, at the European Testing Conference. I both attended the conference and presented a workshop about “Approval testing with TextTest“. I won’t say any more about that, since Ben Linders did a brilliant write-up already that was published on InfoQ. There were several other highlights, and I wanted to just share a paragraph or so about each.

Mob Testing is what happens when your development team decides to work together on testing tasks as a Mob. I took part in a workshop where Maaret Pyhäjärvi facilitated two different mobbing exercises, one where we automated some UI tests using Selenium, and one where we practiced Test-Driven Development on the FizzBuzz kata. I have already done some Mob Programming and this felt very similar, except the focus was on developing tests rather than production code. It seems to have similar benefits – you have access to all the knowledge of everyone in the team, and you can learn things you didn’t even know to ask about. It makes pairing seem like a slow way to share good working practices.

JUnit 5 is on the horizon, and has several useful improvements over the previous version. Generally the syntax clutter is reduced, and the way you create parameterized tests has been overhauled. The most significant change though, (especially for people like me who work on developing other testing tools), seems to be that they’re designing the test-running engine to be separated so you can re-use it to run other kinds of tests. Any infrastructure that works with JUnit will then be able to run these other tests as well. In principle it opens up JUnit’s success as a platform, to be re-used by other test frameworks. Thanks Nicolai Parlog for this useful summary of the next generation of one of the most widely-used tools in the Java world.

Joel Hynoski has worked at many of the tech giants in our industry, including Google, Twitter, Apple, and now Lyft. He spoke about some of the engineering challenges they had overcome, specifically in the area of testing. One thing I liked was their tool that detects flaky tests, and puts them in ‘jail’. (A flaky test is one that sometimes passes and sometimes fails, when run against the same code. They are a pain and can be a huge waste of time.) When a test is in ‘jail’, that means it’s no longer run in the build pipeline, so it doesn’t block new releases. It instead gets flagged as needing maintenance. They then have a SLA that says how long a test is allowed to remain in jail before an engineer needs to look at it and fix the flakyness – a day or two I think.

I can feel a little in awe of someone who has worked in those kinds of famous engineering organizations, working at web-scale with some of the best developers in our industry. What I found most encouraging about talking to Joel, was that he was very down to earth about the problems these organizations face. They still battle with legacy code, despite it often only being a few years old. They have trouble creating reliable automated tests. The developers don’t always trust the test automation. They still have production bugs and hotfixes…

Alex Schladebeck spent the first ten minutes of her presentation giving a splendid rant about the bad reputation of UI testing. To summarize: (criticisms she hears about UI tests -> her responses)

UI tests give slow feedback -> and valuable feedback, doesn’t have to be after every build
need more infrastructure/machines -> yes, deal with it
they’re the top of the test pyramid -> they are in the pyramid! you can’t ignore them. They find different stuff than unit tests. Consider your context.
they’re flaky -> they’re not as bad as they used to be! Could be your app isn’t designed for testabilty? Could be your test design is poor?
they cause lots of work when small changes in your app -> that happens in development work too! Also, happens more if you design them badly.

She then went on to give some excellent advice about how to design your UI tests. It was mostly about layering your test code in different levels of abstraction, and getting a good collaboration going between developers and testing specialists.

Conferences are about meeting people and the organizers of this conference had very deliberately scheduled sessions to encourage this. We had a ‘speed dating’ session where you talk to about 8 random people for five minutes each. We had a ‘lean coffee’ session, where all the speakers were each asked to facilitate a discussion table. I thought this worked particularly well as a way to find people with similar interests, and get them to talk about their experiences. The hands-on workshops were all at the same time, so you had to go to one and not just attend talks all the time. There was also an open space scheduled when it would not clash with any other kinds of sessions. I thought all this together made for a pretty welcoming conference where you were bound to get to know new people.

Overall I had a really good time at this conference and I’d recommend it to both testers and developers with a strong quality focus.

I’ve been favouring an Approval Testing approach for many years now, since I find it pretty useful in many situations, particularly for acceptance tests. Not many people I meet know the term though, and even fewer know how to use the technique. Recently I’ve put together some small exercises – code katas – to help people to learn about it. I’ll be going through them at a couple of upcoming conference workshops, but for all you people who won’t be there in person, I’m publishing them on github as well.

I’ve got three katas set up now, Minesweeper, Yatzy and GildedRose. If you’ve done any of these katas before, you’ll probably have been using ordinary unit testing techniques. Hopefully by doing them again, with Approval Testing, you’ll learn a little about what’s different about this technique, and how it could be useful.

Before you can do the katas, you’ll need to install an approval testing tool. I’m one of the developers of TextTest, so that’s the tool I’ve set up right now. Below are some useful commands for a debian/ubuntu machine for installing it.

I’m still developing these exercises, and would like feedback about what you think of them. For example I have Python versions for all three, but only one has a Java version as yet. Do people want more translations? Do let me know how you get on, and what you think!

Installation instructions

You will need to have Python 2, and TextTest. (Unfortunately TextTest uses a GUI library that doesn’t support Python 3). For example:

$ sudo apt-get install python-pip
$ sudo pip install texttest

For more detailed instructions, and for other platforms see the texttest installation docs. For more general documentation, see the texttest website.

You need to have an editor and a diff tool configured for texttest to use. I recommend sublime text and meld. Install them like this:

$ sudo add-apt-repository ppa:webupd8team/sublime-text-3
$ sudo apt-get update
$ sudo apt-get install sublime-text-installer
$ sudo apt-get install meld

Then you need to configure texttest to use them:

$ cd
$ mkdir .texttest
$ touch .texttest/config
$ subl .texttest/config

Enter the following in that file, and save:


For convenience, I also like to create an alias ‘tt’ for starting TextTest for these exercises. Change directory to one of the exercise repositories, then a ‘tt’ command should start the TextTest GUI and show the tests for that exercise. Define such an alias like this:

alias tt='texttest -d python -c .'

Two of the exercises start with a small test suite for you to build on. There should be instructions in the README file of each respective exercise, to help you to get going. If you really can’t work out what to do, have a look at the sample solutions and see if that helps. These are also on github: Minesweeper-sample-solution, Yatzy-sample-solution, GildedRose-sample-solution