Coding Is Like Cooking

Archive for 2012

Principles of Agile Test Automation

Please note – As of March 2013, I have rewritten this post in the light of further experience and discussions. The updated post is available here.

I feel like I’ve spent most of my career learning how to write good automated tests in an agile environment. When I downloaded JUnit in the year 2000 it didn’t take long before I was hooked – unit tests for everything in sight. That gratifying green bar is near-instant feedback that everthing is as expected, my code does what I intended, and I can continue developing from a firm foundation.

Later, starting in about 2002, I began writing larger granularity tests, for whole subsystems; functional tests if you like. The feedback that my code does what I intended, and that it has working functionality has given me confidence time and again to release updated versions to end-users.

Often, I’ve written functional tests as regression tests, after the functionality is supposed to work. In other situations, I’ve been able to write these kinds of tests in advance, as part of an ATDD, or BDD process. In either case, I’ve found the regression tests you end up with need to have certain properties if they’re going to be useful in an agile environment moving forward. I think the same properties are needed for good agile functional tests as for good unit tests, but it’s much harder. Your mistakes are amplified as the scope of the test increases.

I’d like to outline four principles of agile test automation that I’ve derived from my experience.

Coverage

If you have a test for a feature, and there is a bug in that feature, the test should fail. Note I’m talking about coverage of functionality, not code coverage, although these concepts are related. If your code coverage is poor, your functionality coverage is likely also to be poor.

If your tests have poor coverage, they will continue to pass even when your system is broken and functionality unusable. This can happen if you have missed out needed test cases, or when your test cases don’t check properly what the system actually did. The consequences of poor coverage is that you can’t refactor with confidence, and need to do additional (manual) testing before release.

The aim for automated regression tests is good Coverage: If you break something important and no tests fail, your test coverage is not good enough. All the other principles are in tension with this one – improving Coverage will often impair the others.

Readability

When you look at the test case, you can read it through and understand what the test is for. You can see what the expected behaviour is, and what aspects of it are covered by the test. When the test fails, you can quickly see what is broken.

If your test case is not readable, it will not be useful. When it fails you will have to dig though other sources outside of the test case to find out what is wrong. Quite likely you will not understand what is wrong and you will rewrite the test to check for something else, or simply delete it.

As you improve Coverage, you will likely add more and more test cases. Each one may be fairly readable on its own, but taken all together it can become hard to navigate and get an overview.

Robustness

When a test fails, it means the functionality it tests is broken, or at least is behaving significantly differently from before. You need to take action to correct the system or update the test to account for the new behaviour. Fragile tests are the opposite of Robust: they fail often for no good reason.

Aspects of Robustness you often run into are tests that are not isolated from one another, duplication between test cases, and flickering tests. If you run a test by itself and it passes, but fails in a suite together with other tests, then you have an isolation problem. If you have one broken feature and it causes a large number of test failures, you have duplication between test cases. If you have a test that fails in one test run, then passes in the next when nothing changed, you have a flickering test.

If your tests often fail for no good reason, you will start to ignore them. Quite likely there will be real failures hiding amongst all the false ones, and the danger is you will not see them.

As you improve Coverage you’ll want to add more checks for details of your system. This will give your tests more and more reasons to fail.

Speed

As an agile developer you run the tests frequently. Both (a) every time you build the system, and (b) before you check in changes. I recommend time limits of 2 minutes for (a) and 10 minutes for (b). This fast feedback gives you the best chance of actually being willing to run the tests, and to find defects when they’re cheapest to fix.

If your test suite is slow, it will not be used. When you’re feeling stressed, you’ll skip running them, and problem code will enter the system. In the worst case the test suite will never become green. You’ll fix the one or two problems in a given run and kick off a new test run, but in the meantime someone else has checked in other changes, and the new run is not green either. You’re developing all the while the tests are running, and they never quite catch up. This can become pretty demoralizing.

As you improve Coverage, you add more test cases, and this will naturally increase the execution time for the whole test suite.

How are these principles useful?

I find it useful to remember these principles when designing test cases. I may need to make tradeoffs between them, and it helps just to step back and assess how I’m doing on each principle from time to time as I develop.

I also find these principles useful when I’m trying to diagnose why a test suite is not being useful to a development team, especially if things have got so bad they have stopped maintaining it. I can often identify which principle(s) the team has missed, and advise how to refactor the test suite to compensate.

For example, if the problem is lack of Speed you have some options and tradeoffs to make:

Invest in hardware and run tests in parallel (costs $)
Use a profiler to optimize the tests for speed the same as you would production code (may affect Readability)
push down tests to a lower level of granularity where they can execute faster. (may reduce Coverage and/or increase Readability)
Identify key test cases for essential functionality and remove the other test cases. (sacrifice Coverage to get Speed)

Explaining these principles can promote useful discussions with people new to agile, particularly testers. The test suite is a resource used by many agile teamembers – developers, analysts, managers etc, in its role as “Living Documentation” for the system, (See Gojko Adzic‘s writings on this). This emphasizes the need for both Readability and Coverage. Automated tests in agile are quite different from in a traditional process, since they are run continually throughout the process, not just at the end. I’ve found many traditional automation approaches don’t lead to enough Speed and Robustness to support agile development.

I hope you will find these principles will help you to reason about the automated tests in your suite.

Posted by Emily Bache on 2012-08-06 at 12:31 under Coding Skills.
Tags: agile, ATDD, BDD, TDD, testing
4 Comments.

Joseph Wilk on Acceptance Testing in a Startup

I’ve heard Joseph Wilk speak before, so I was delighted when he accepted our invitation to come to Scandinavian Developer Conference and share some ideas around “Acceptance testing in the land of the Startup”. In this post I aim to summarize my understanding of what Joseph said. I think what he is doing at Songkick is really interesting and innovative, and is applicable beyond that particular environment.

Introduction

Joseph began by introducing what he meant by “Startup” using Eric Ries‘ definition:

“a human institution designed to create new products and services under conditions of extreme uncertainty”

“Acceptance testing” is usually about checking the product meets the previously agreed requirements – but if everything is changing all the time, that is not as simple as it sounds. Joseph’s acceptance tests reflect what “done” looks like in conditions of extreme uncertainty.

Joseph talked about the Cynefin model (by Dave Snowden) and how working in a startup feels like you’re in chaos a lot of the time, although parts of the business may be merely complicated or even simple. An effective strategy in the “chaos” part of the Cynefin model is “Act, Sense, Respond”, which he related to the lean startup cycle of “Build, Measure, Learn”.

The “Build” part of that cycle for their software product often takes more time than the other parts, and work should be coordinated between several people. During the “Build” phase you can use automated tests to help with communication and feedback. In the “Measure” phase, you’re primarily looking for feedback from the actual end users, rather than automated tests.

The rest of the talk was largely about how Songkick use automated tests during the “Build” phase, and how they do acceptance testing with end users in the “Measure” phase.

The startup

The software Joseph is working on is for the startup Songkick which is devoted to live music. They are about 20-30 people, and the majority are programmers. They have two dedicated UX specialists and two QA specialists. Most people are multi-skilled, care deeply about the quality of the product, and are not afraid to learn how to help outside of their speciality.

Songkick holds three principles dear:

Embrace Change
Usability
Testing

The rest of his talk was about the concrete practices they use around this last principle – Testing. Joseph stressed that he was reporting what they had found to work, and not holding up any “best practice” or industry gold standard.

Features

At Songkick, they use Behaviour Driven Development. They are continually building and improving a “ubiquitous language” to describe their system, by writing Features and Scenarios in a semi-formal syntax. The tool Cucumber is used to turn these features and scenarios into Executable Specifications (aka Automated Tests).

Joseph gave a short intro to how Cucumber works – take a look at this site if you’re not familiar with it. In short, Cucumber lets you define many “Features”, each of which define numerous “Scenarios” which exemplify the desired feature behaviour. It requires you use a particular syntax called “Gherkin”. The gist of it is that you use the following formats when writing features and scenarios:

Feature:
“In order to… I want… So that…”

Scenario:
“Given… When… Then…”

If you put in some extra work and also write “Step Definitions”, your scenarios become executable, and can automatically check the application behaves as expected when run.

The primary purpose of the Cucumber specifications is to promote communication within the team: the words are carefully chosen to have clear meanings related to the domain of the system you’re building.

Hypotheses

In Lean Startup you don’t talk so much about building a “Feature”, more about having a “Hypothesis”. As in “if we add a pink button here, do we get more traffic to the site?”

Joseph showed the following example of a Feature with a Hypothesis:

Facebook signup
In order to increase signup
I want visitors to sign up via Facebook
So that we see a 10% increase in signups

That last line is the measurable part of the hypothesis, that makes it more than just a normal Cucumber Feature. When you’ve delivered a system that has this feature, it should lead to 10% more signups. If it doesn’t, the system might be modified to remove this feature. Features written in this form (In order to… I want… So that…) are prioritized in the product backlog.

When a developer is ready to start working on a feature, she calls a meeting to discuss it. At the meeting there should be representatives from four different job roles: a developer, tester, business analyst and usability expert. Together they discuss what the feature is, the details of how it should work, and perhaps create some wireframes or UI mockups. They will also brainstorm perhaps 7-8 scenarios, such as:

successful signup via facebook
failed signup via facebook
facebook changes their API and no one can sign in any more

After the meeting, the developer starts working on the code to implement the feature, bearing in mind the discussion and the scenarios. The people in the other job roles will also get involved in development, and assessing whether the feature is ready for deployment. Some or all of the scenarios they have come up with will be automated as executable with Cucumber.

There is an overhead to automating scenarios with Cucumber: implementing “Step Definitions” takes time, and afterwards the scenarios are relatively slow to execute. Joseph said that they were more and more only automating a few scenarios for each feature, often just a happy and a sad path. The other scenarios would perhaps be used in unit tests, or exploratory testing. The most important aspect of writing many scenarios was to discover issues with a feature before implementation.

All their Cucumber features and scenarios are published in a searchable web interface called “Relish” which he said encouraged developers to use more business-friendly language, since they knew non-programmers would look there. Joseph said features containing incomprehensible technobabble like “Given the asynchronous message buffer queue has been emptied” (!) do appear occasionally, but don’t last long. Their ubiquitous language is highly visible and in daily use.

Actual Acceptance Tests

Songkick uses a Continuous Delivery approach, where anyone in the company can deploy the latest code, at the push of a button. They have invested a lot of effort to make it all automated, so that when anyone checks in code changes, Jenkins builds it, runs all the tests, (including Cucumber specs), and allows anyone to deploy any successful build. He said deployment was so easy even a dog could do it, (and had on one occasion!).

Joseph explained that many features would be built in such a way that they were “turned off” in production for a few days while they were being built. Only once the whole feature was implemented and present would they activate the code, and perhaps only then for a subset of all the visitors to the site. In this way you can build a feature that takes a few days to implement, and continue to release the new code every few hours. The half built feature is being deployed, it’s just switched off.

The reason for all this effort to get code out of the door is that a feature can’t be considered Done until its hypothesis has been validated. As Joshua Kerievsky said:

“A story isn’t done until it’s being used by real users in production and has been validated to be a useful part of a product”

By deploying often, you get that confirmation or rejection as soon as possible, so you can evaluate whether the feature was worth building. “Build, Measure, Learn”. The “Measure” part happens after an acual customer has got hold of the code and started using it. That is the real acceptance testing that’s going on, not the Cucumber scenarios.

So that’s the way they work today, but it hasn’t always been quite like that. Joseph also talked about why they have arrived at this process.

The build time issue

The Cucumber scenarios are run at every checkin, and Joseph said that at one point it took 4 hours to run them all. This was a serious problem, delaying feedback and slowing the continuous deployment process to a crawl.

Their first approach was to use their technical skills to divide up the tests and run them in parallel on the cloud. The build time went down from 4 hours to just 16 minutes, but at a cost of about $7000 per month in cloud servers. That proved too expensive to be sustainable, so they reconsidered.

The automated tests are there to give confidence that the system works before you deploy. They realized that some of the tests gave more confidence than others, so they removed something like 60% of them. These less valuable tests were then run much less frequently. This took them off the critical path to deployment, and enabled more frequent releases. So far this strategy has led to only insignificant problems in production.

They also re-evaluated their policy for which scenarios to automate in Cucumber, and started putting much more emphasis on unit tests. When you have comprehensive Cucumber tests, it’s easy to have lots of confidence that the system works, and not bother with unit tests. Joseph said they were missing the design benefits of unit testing though – unit tests force your code to be decomposed into units! These days they write fewer Cucumber scenarios, and more unit tests.

Duplicating effort with QA

In addition to automated Cucumber scenarios, the QA people do manual testing before deployment. At one point the developers were surprised to learn than maybe 60-70% of these manual tests were covering functionality that was already well covered by Cucumber specs. They realized that the QA people had little confidence in the automated tests.

Once they had noticed the problem, they began to include the QA people more in the scenario writing process, so they could learn about what the tests did, and how they work. This helped give QA confidence to remove some of the manual tests, so now they only test the most crucial functionality manually before deployment.

Metrics as feedback

Joseph invented a tool called “Limited Red” that records metrics from every test run. It shows failure rates for each scenario, and correlates that with changes in the corresponding feature file, using the Git log. Using this data, he can plot graphs that show each feature, how many times it has failed, and how often it has been edited.

For example, he might find one of the features has failed 16 times, while the code in the feature was only changed 4 times. This could indicate the feature is testing an area of the code that is poorly implemented – the code is broken more often than the test is invalid. This gives developers feedback that they can act on to improve the quality of both the code and the Cucumber scenarios.

Joseph has a blog post with more detail about this tool and metrics approach.

My Concluding Thoughts

Most startups fail, and I have little insight into whether Songkick will be successful in the long run. I am fascinated though, by the way they are working. It seems to me they have a great process that promotes communication and feedback, coupled with enough introspection to adapt it as they learn more.

I’m interested to note that Songkick has extended the “Rule of Three” introduced by Lisa Crispin and Janet Gregory. In their book on “Agile Testing” they encourage you to include a tester in all discussions between a developer and customer. At Songkick they appear to have business analysts in the customer role, and additionaly include a UX (usability) expert in the discussion.

Joseph didn’t mention it, but I suspect that his experiences automating fewer scenarios with Cucumber may be one of the reasons Liz Keogh wrote her blog post “Step away from the tools“. Or maybe the blog came first, I don’t know. The advice is the same, in any case – BDD is about communication first, test automation second.

The “Limited Red” tool is quite new, and I think Joseph is still learning how to act on the data he’s gathering. Having said that, I have high hopes that eventually this kind of measurement and feedback will find its way into more automated testing tools and Jenkins builds. It seems to me that finding out which of your tests are giving the best value for money and which are costing the most to maintain will be generally useful.

My next challenge is to work out how to apply these kinds of ideas to the situation I find myself in right now. It’s a large company not a startup, the customer seems totally inaccessible, the release cycle is years not hours, the build is slow (when it works at all), and UX experts are thin on the ground. Hmm. As Joseph said, his story is about what they’ve found to work, not some universal “best practices”. It’s given me some new ideas and encouragement – now I just have to work out what principles and practices I can reasonably introduce where I am.

Notes 2013-5-13: one of Joseph’s colleagues wrote a post “From 15 hours to 15 seconds” explaining how they got the build time so low, a few months after I wrote this post. I also got a note from Liz Keogh explaining that her blog post “Step away from the tools” was written well before the events at Songkick, in March 2011, and that she consulted with them briefly after that, discussing the build time problem amongst other issues.

Posted by Emily Bache on 2012-04-23 at 08:26 under Experience Report.
Comments Off on Joseph Wilk on Acceptance Testing in a Startup.