Bob Martin has just written a post in his blog where he tells the story of a test manager who has 80 000 manual tests, and wishes they were automated instead. Bob writes:

“One common strategy to get your tests automated is to outsource the problem. You hire some team of test writers to transform your manual tests into automated tests using some automation tool. These folks execute the manual test plan while setting up the automation tool to record their actions. Then the tool can simply play the actions back for each new release of the system; and make sure the screens don’t change.”

Bob then goes on to explain why this is such a terrible idea – and blames it all on coupling. That the tests and the GUI are coupled to the extent that when you change the GUI, loads of tests break. Wheras humans can handle a fair amount of GUI changes and still correctly determine whether a manual test should pass or fail, machines fall over all too easily and just fail as soon as something unexpected happens. So you end up re-recording them, which can cost as much as just doing the tests manually in the first place.

These problems are of course bigger or smaller depending on the GUI automation tool you choose. Anything that records pixel positions will fall over when you simply change the screen resolution, let alone when you add new buttons and features in your GUI. More modern tools record the names or ids of the widgets, so they don’t break if the widget simply moves to another part of the screen. In other words, you reduce your coupling.

Geoff has been working on PyUseCase which takes this to another level. Instead of coupling the tests to widget names, you couple them to “domain actions”. This makes your tests even more robust in the face of gui changes. A drop down list can turn into a set of radio buttons and your tests won’t mind, since they just say something like “select airport SFO”. This doesn’t isolate you from the big changes, like moving the order of the screens in a wizard around, but since the tests are written in plain text, in a language any domain expert can read, they are relatively cheap to update.

There is another respect in which machines under-perform compared to manual testers. An intelligent human will usually do a certain amount of exploration beyond the scripted test steps they have infront of them. They try to understand the purpose of the test, click around a bit and ask questions when parts of the system peripheral to the test in hand start to look odd. Machines don’t do any exploration, and in fact often don’t even notice errors on parts of the screen they havn’t been told to look at.

Geoff’s PyUseCase can partly address this kind of a problem. Used together with TextTest, it will continually scan the log the System Under Test produces, and fail the test for example if any stack traces appear. PyUseCase also automatically produces a low fidelity ascii-art-esque log of how the current screen looks, and can compare it against what it looked like last time the test ran. Changes are flagged as test failures, which will bring to your attention the change in an unrelated corner of the screen which says “32nd December” instead of “1st January”.

I know that sounds like we just introduced a huge amount of coupling between the tests and the way the GUI looks, and yes, we have. The difference is that this coupling is very easy to manage. If 1000 tests all fail saying “expected: 1st January, found: January 1st”, TextTest handily groups all the test failures and lets you accept or reject the change en-masse. So it is very little work to update a lot of tests when the GUI just looks different, but you don’t care.

There is still a problem though, that the machine will not explore outside of the scripted steps you tell it to perform. So you will have to do some manual exploratory testing too, not everything can be automated.

So a simplistic lets-just-automate-our-manual-tests is a bad idea because machines can’t handle GUI changes as well as humans can, and because machines don’t look around and explore. Potentially your automated tests will cost more than your manual tests, and find fewer bugs.

So should we stick with our manual test suite then? No, of course not. The value of automated tests is not simply that you can run them more cheaply than manual tests, it is that you can run them more often – at every build, constantly supplying developers with valuable feedback rather than just at the end of the release cycle. It is this kind of feedback that enables refactoring, and lets developers build quality code from the start. That is their real gain over manual tests.

Bob Martin’s suggestion is that you shouldn’t rely on expensive GUI tests for this kind of feedback – only perhaps 15% of your tests should be GUI reliant. The rest run against some kind of api, which is less volatile and hence cheaper to maintain. With the kinds of tools Bob I suspect has been using for GUI testing I’m not surprised he says this. I just think that with tools like PyUseCase and TextTest the costs are much reduced, and call for reconsideration of this ratio. Looking at Geoff’s self tests for TextTest (a GUI intensive tool), around half are testing through the GUI, using pyUseCase. Basically I don’t think GUI tests have to be as bad and expensive as Bob makes out.

A little while ago we had the first meeting of our new coding dojo here in Göteborg. We are focussing on learning Test Driven Development using Java and Eclipse. I was very encouraged that two of my colleagues, Fredrik and Martin, volunteered to help organize the group. There was actually quite a lot of interest generally, and we filled all 12 places and even have a (small) waiting list. I didn’t want the group to grow too big, since the dojo style of learning should be quite participatory, and the time slot is only 2 hours. Everyone should get a chance to be heard, and to take the keyboard.

At the meeting I introduced the dojo concept with a set of slides I have used before. At dojo meetings our focus should be on deliberate practice, aquiring good coding habits, mutual encouragement and feedback.

We then took on KataFizzBuzz which went very smoothly. I started by introducing the Kata, using a picture of the “teacher” pointing at you, asking you to say the next number in the FizzBuzz sequence. She is sufficiently scary looking that you definitely need to write a program to print a FizzBuzz cheat sheet before the next lesson!
I also introduced something I havn’t done before at a dojo meeting – starting with some code rather than a blank editor. I had the acceptance test for the Kata already coded up and failing. When I practiced this Kata I realized the hardest part was writing the acceptance test, which captures the sequence that is written to System.out. I could have begun the meeting by written it in front of the audience, but I really wanted to get them coding, not just watching me.

Martin wrote the first unit test, fizzbuzz(1) -> [1] and I noticed that his style is slightly different from mine. He fixed all the compiler errors as he went along, whereas I would tend to leave them all until I finish the test, or at least until I want to run it. Maybe that is because he has worked in Java/Eclipse longer than me, and that is the way Eclipse likes you to work. Anyway, I then implemented the code to make the test pass (fake it!) and wrote the next test fizzbuzz(2) -> [1,2]. So then he had to write just enough code to make it pass (a simple loop).

Then we handed the keyboard to two members of the audience, and Martin and I sat down in their chairs, and the Randori was really underway. We continued with this pair and two others using this ping-pong style until after about an hour we had completed the first part of the Kata – printing out the basic fizzbuzz sequence up to 100.

I suggested that the pair at the front try to run the acceptance test, which they did, and it failed. The reason was that the unit tests had been testing an internal method fizzbuzz() and the acceptance test checked that when you call main() you get the right sequence written to System.out. It was at this point I wondered if I had made the right decision when I wrote the acceptance test in advance, since that meant the guy at the keyboard clearly didn’t really understand what it was for. His first thought of how to make it pass was to change it to call fizzbuzz() instead of main(), until I stopped him – “No! don’t change the test! Fix the code!”. I felt like I was rapping him over the knuckles with a ruler (something I am thankful my Maths teacher never did).

Towards the end of the meeting so we held a 10 minute retrospective. People seemed cautiously positive towards TDD and the dojo in general, but I think they maybe still getting used to the format and working out whether it is “ok” to be openly critical. I hope for more dissent, discussion and group learning next time.

Geoff has just put up a couple of new pages on the texttest website, with some coverage statistics for his self tests. He uses coverage.py to produce this report which shows all the python modules in texttest, and marks covered statements in green. I think it’s pretty impressive – he’s claiming over 98% statement coverage for the over 17 000 lines of python code in texttest.

I had a poke around looking for some numbers to compare this to, and found on this page someone claiming Fitnesse has 94% statement coverage from its unit tests, and the Java Spring framework has 75% coverage. It’s probably unwise to compare figures for different programming languages directly, but it gives you an idea.

Geoff also publishes the results of his nightly run of self tests here. It looks a bit complicated, but Geoff explained it to me. 🙂 He’s got nearly 2000 tests testing texttest on unix, and about 900 testing it on windows. As you can see, the tests don’t always pass, some are annoying ones that fail sporadically, some are due to actual bugs, which then get fixed. So even though he rarely has a totally green build, the project looks healthy overall, with new tests and fixes being added all the time.

Out of those 3000 odd tests that get run every night, Geoff has a core of about 1000 that he will run before every significant check-in. Since they run in parallel on a grid, they usually take about 2 minutes to execute. (When he has to run them at home in series on our fairly low spec linux laptop they take about half an hour.)

Note that we aren’t talking about unit tests here, these are high level acceptance tests, running the whole texttest system. About half of them use PyUseCase to simulate user actions in the texttest GUI, the rest interact with the command line interface. Many of the tests use automatically generated test doubles to simulate interaction with 3rd party systems like version control, grid engines, diff programs etc.

Pretty impressive, don’t you think? Well I’m impressed. But then I am married to him so I’m not entirely unbiased 🙂

I’ve been doing some work lately creating automated functional test suites using Selenium RC to simulate user interaction with a web GUI. I discovered quickly that the tests you record directly from selenium are rather brittle, and hard to read. In order to make the tests more robust and readable, I have been extracting reusable chunks of script that make sense from the user perspective, into separate methods. For example when testing a page for registering a new provider, you might have a ProviderPage domain class, with method “createNewProvider”. This method encapsulates all the selenium calls that interact with the page, and lets your test be written in terms of the domain.

I just saw this article from Patrick Wilson Welsh basically saying the same thing, only his DSL has three layers of indirection instead of just two. As well as encapsulating page operations in a Page class, he encapsulates operations on widgets within a page. I hadn’t thought of doing that. It makes the code in the Page class more readable. I might try that, and see if it improves my code.

Gathering ideas for my new dojo 🙂

Ivan Sanchez wrote about starting a coding dojo, and he rekons a Randori is best with 10 people or less. We will be more than 10 at JDojo@gbg. He suggests a prepared kata in that case. That might be possible. His favourite starting kata is KataMinesweeper.

Danilo Sato wrote about how to find suitable Katas, and suggests several for beginning dojos, including KataMinesweeper.

Gary Pollice wrote an article about what a coding dojo is, which is quite well explained, but doesn’t give any specific advice for new dojos.

The guys running the finnish dojo have a similar article about what a coding dojo is, and some rules. They put a maximum of 15 participants on their randori. They also introduce “iterations” of 30 minutes, and spend 5 minutes planning in between.

Lots of ideas to think about, anway.