I wrote a blog post about these tools on my company blog. In swedish.
Archive for the ‘Uncategorized’ Category
Bob Martin has just written a post in his blog where he tells the story of a test manager who has 80 000 manual tests, and wishes they were automated instead. Bob writes:
“One common strategy to get your tests automated is to outsource the problem. You hire some team of test writers to transform your manual tests into automated tests using some automation tool. These folks execute the manual test plan while setting up the automation tool to record their actions. Then the tool can simply play the actions back for each new release of the system; and make sure the screens don’t change.”
Bob then goes on to explain why this is such a terrible idea – and blames it all on coupling. That the tests and the GUI are coupled to the extent that when you change the GUI, loads of tests break. Wheras humans can handle a fair amount of GUI changes and still correctly determine whether a manual test should pass or fail, machines fall over all too easily and just fail as soon as something unexpected happens. So you end up re-recording them, which can cost as much as just doing the tests manually in the first place.
These problems are of course bigger or smaller depending on the GUI automation tool you choose. Anything that records pixel positions will fall over when you simply change the screen resolution, let alone when you add new buttons and features in your GUI. More modern tools record the names or ids of the widgets, so they don’t break if the widget simply moves to another part of the screen. In other words, you reduce your coupling.
Geoff has been working on PyUseCase which takes this to another level. Instead of coupling the tests to widget names, you couple them to “domain actions”. This makes your tests even more robust in the face of gui changes. A drop down list can turn into a set of radio buttons and your tests won’t mind, since they just say something like “select airport SFO”. This doesn’t isolate you from the big changes, like moving the order of the screens in a wizard around, but since the tests are written in plain text, in a language any domain expert can read, they are relatively cheap to update.
There is another respect in which machines under-perform compared to manual testers. An intelligent human will usually do a certain amount of exploration beyond the scripted test steps they have infront of them. They try to understand the purpose of the test, click around a bit and ask questions when parts of the system peripheral to the test in hand start to look odd. Machines don’t do any exploration, and in fact often don’t even notice errors on parts of the screen they havn’t been told to look at.
Geoff’s PyUseCase can partly address this kind of a problem. Used together with TextTest, it will continually scan the log the System Under Test produces, and fail the test for example if any stack traces appear. PyUseCase also automatically produces a low fidelity ascii-art-esque log of how the current screen looks, and can compare it against what it looked like last time the test ran. Changes are flagged as test failures, which will bring to your attention the change in an unrelated corner of the screen which says “32nd December” instead of “1st January”.
I know that sounds like we just introduced a huge amount of coupling between the tests and the way the GUI looks, and yes, we have. The difference is that this coupling is very easy to manage. If 1000 tests all fail saying “expected: 1st January, found: January 1st”, TextTest handily groups all the test failures and lets you accept or reject the change en-masse. So it is very little work to update a lot of tests when the GUI just looks different, but you don’t care.
There is still a problem though, that the machine will not explore outside of the scripted steps you tell it to perform. So you will have to do some manual exploratory testing too, not everything can be automated.
So a simplistic lets-just-automate-our-manual-tests is a bad idea because machines can’t handle GUI changes as well as humans can, and because machines don’t look around and explore. Potentially your automated tests will cost more than your manual tests, and find fewer bugs.
So should we stick with our manual test suite then? No, of course not. The value of automated tests is not simply that you can run them more cheaply than manual tests, it is that you can run them more often – at every build, constantly supplying developers with valuable feedback rather than just at the end of the release cycle. It is this kind of feedback that enables refactoring, and lets developers build quality code from the start. That is their real gain over manual tests.
Bob Martin’s suggestion is that you shouldn’t rely on expensive GUI tests for this kind of feedback – only perhaps 15% of your tests should be GUI reliant. The rest run against some kind of api, which is less volatile and hence cheaper to maintain. With the kinds of tools Bob I suspect has been using for GUI testing I’m not surprised he says this. I just think that with tools like PyUseCase and TextTest the costs are much reduced, and call for reconsideration of this ratio. Looking at Geoff’s self tests for TextTest (a GUI intensive tool), around half are testing through the GUI, using pyUseCase. Basically I don’t think GUI tests have to be as bad and expensive as Bob makes out.
Geoff has just put up a couple of new pages on the texttest website, with some coverage statistics for his self tests. He uses coverage.py to produce this report which shows all the python modules in texttest, and marks covered statements in green. I think it’s pretty impressive – he’s claiming over 98% statement coverage for the over 17 000 lines of python code in texttest.
I had a poke around looking for some numbers to compare this to, and found on this page someone claiming Fitnesse has 94% statement coverage from its unit tests, and the Java Spring framework has 75% coverage. It’s probably unwise to compare figures for different programming languages directly, but it gives you an idea.
Geoff also publishes the results of his nightly run of self tests here. It looks a bit complicated, but Geoff explained it to me. 🙂 He’s got nearly 2000 tests testing texttest on unix, and about 900 testing it on windows. As you can see, the tests don’t always pass, some are annoying ones that fail sporadically, some are due to actual bugs, which then get fixed. So even though he rarely has a totally green build, the project looks healthy overall, with new tests and fixes being added all the time.
Out of those 3000 odd tests that get run every night, Geoff has a core of about 1000 that he will run before every significant check-in. Since they run in parallel on a grid, they usually take about 2 minutes to execute. (When he has to run them at home in series on our fairly low spec linux laptop they take about half an hour.)
Note that we aren’t talking about unit tests here, these are high level acceptance tests, running the whole texttest system. About half of them use PyUseCase to simulate user actions in the texttest GUI, the rest interact with the command line interface. Many of the tests use automatically generated test doubles to simulate interaction with 3rd party systems like version control, grid engines, diff programs etc.
Pretty impressive, don’t you think? Well I’m impressed. But then I am married to him so I’m not entirely unbiased 🙂
I’ve been doing some work lately creating automated functional test suites using Selenium RC to simulate user interaction with a web GUI. I discovered quickly that the tests you record directly from selenium are rather brittle, and hard to read. In order to make the tests more robust and readable, I have been extracting reusable chunks of script that make sense from the user perspective, into separate methods. For example when testing a page for registering a new provider, you might have a ProviderPage domain class, with method “createNewProvider”. This method encapsulates all the selenium calls that interact with the page, and lets your test be written in terms of the domain.
I just saw this article from Patrick Wilson Welsh basically saying the same thing, only his DSL has three layers of indirection instead of just two. As well as encapsulating page operations in a Page class, he encapsulates operations on widgets within a page. I hadn’t thought of doing that. It makes the code in the Page class more readable. I might try that, and see if it improves my code.
Gathering ideas for my new dojo 🙂
Ivan Sanchez wrote about starting a coding dojo, and he rekons a Randori is best with 10 people or less. We will be more than 10 at JDojo@gbg. He suggests a prepared kata in that case. That might be possible. His favourite starting kata is KataMinesweeper.
Gary Pollice wrote an article about what a coding dojo is, which is quite well explained, but doesn’t give any specific advice for new dojos.
The guys running the finnish dojo have a similar article about what a coding dojo is, and some rules. They put a maximum of 15 participants on their randori. They also introduce “iterations” of 30 minutes, and spend 5 minutes planning in between.
Lots of ideas to think about, anway.