In my last post I discussed some exercises put together by Luca Minudel. He was using them as part of a study into how developer’s skill at removing SOLID violations was related to their skill at TDD.

I initially did the exercises in Java, then translated them into Python so we could look at them in our local Python User Group meeting last week. What follows are my own opinions, but I must thank all the pythonistas who were at the meeting, and Andrew Dalke who couldn’t be there but was kind enough to share his opinions and code anyway. What I write below owes a lot to their input. I’m so lucky to be in this community!

We found that in Python, some violations of the Open-Closed Principle are much easier to handle than in Java or C#. You can monkeypatch, and exploit the fact that data is only private by convention. The advantage of being able to get code under test without changing it is of course that you reduce the need for risky refactorings where you unintentionally break the code and get no failing tests to alert you to it.

So the TirePressure example has a violation of the Open-Closed principle where it’s hard to change the specific Sensor used without opening up the existing code. In Python it was dead easy to get under test without modifying the code, because although the sensor is marked as private in the Alarm class, nothing in the language stops you from assigning to it. (See this explanation of how Python considers private data.)

Here’s what (some of) the test code looks like, without making any modifications to the production code: (using the testing framework py.test)

from tire_pressure_monitoring import Alarm

class StubSensor(object):

def __init__(self, pressures):
self.pressures = pressures

def pop_next_pressure_psi_value(self):
return self.pressures.pop()

def test_pressure_in_expected_range_doesnt_trigger_alarm():
alarm = Alarm()
alarm._sensor = StubSensor([18])
alarm.check()
assert not alarm.is_alarm_on()

def test_pressure_below_expected_range_triggers_alarm():
alarm = Alarm()
alarm._sensor = StubSensor([15])
alarm.check()
assert alarm.is_alarm_on()

This means you can quickly get some tests in place. You probably shouldn’t leave the tests like that though, since they are relying on implementation details of the class. This could make them fragile in
the face of refactoring –  we’d rather the test only relied on the public interface. These tests also use  hard coded numbers where it’s not obvious why those values are chosen – another sign the production code could be improved.

Similarly in the HTMLConverter example, there is a violation of the Open-Closed principle that makes it awkward to have the code read from a string instead of a file. One way to get it under test initially is to use monkeypatching. You just pass in a different implementation of the “open” method that doesn’t in fact open a file, but rather provides a file-like object constructed from a string. Since we have duck typing, the production code doesn’t notice the substitution. You do have to be careful to put the “open” method back to normal at the end of the test though, or the test will have side effects.

So for example:

from cStringIO import StringIO

import unicode_to_html_converter
from unicode_to_html_converter import UnicodeFileToHtmlTextConverter

class StubOpen(object):
def __init__(self, text):
self.text = text

def __call__(self, *args, **kwargs):
return StringIO(self.text)

def test_convert_to_html():
try:
stub_open = StubOpen("text to convert <>&\"\n")
unicode_to_html_converter.open = stub_open
converter = UnicodeFileToHtmlTextConverter("a filename that will be ignored by StubOpen")
# unfortunately this next line is being mangled by my syntax highlighter.
# I think you know what I mean though
assert "text to convert <>&"
" == converter.convert_to_html()
finally:
unicode_to_html_converter.open = open

Again you can quickly get some tests in place, at the cost of a test that is somewhat awkward to read. The other way to get the code under test without changing it is the same as for C# or Java – put the text in a temporary file that is deleted afterwards. This may be simpler to understand, but will be slower to execute. It’s a tradeoff.

I guess the thing with these exercises is that you can get them under test in various ways without correcting the SOLID violations, or the other problems in the code. The idea is that skilled developers will not leave it at that. They will listen to the feedback the tests give them about the code being unecessarily hard to test, and respond by improving the design.

Do you automatically get better design with TDD? Does an otherwise average software developer produce superior designs if they write the tests first rather than afterwards? Does it make a difference what style of TDD you use?

incident #1

I was at a session at XP2012 with J.B. Rainsberger called “Architecture without Trying”. He demonstrated how he could develop a software system for Point-of-Sale terminals using TDD, and how the design naturally tended towards an MVC pattern as he did it. He claimed that purely by doing TDD, and focussing on two things, (removing duplication and improving names) that a good design would naturally emerge.

incident #2

I heard a talk by Luca Minudel at Agile Testing Days 2011 called “TDD with Mock Objects: Design Principles and Emergent Properties”. He was talking about a study he had done where he got people with varying levels of experience at TDD to do four short exercises. He also got them to answer a questionnaire about their knowledge of SOLID principles, and TDD. He then evaluated how well the designs they came up with in the exercises adhered to SOLID principles, and tried to correlate that with their TDD skill. He found that the people skilled in TDD did better in the exercises than those who only knew the theory of SOLID principles. The practice of TDD seems to help people with design. Luca also found that those more experienced with the London School of TDD did even better than other TDDers.

incident #3

I was working at a client recently when I met a developer from a different department. He came to see me several times over a period of a couple of weeks, and asked for advice about TDD. On about his fourth visit he told me he had written some code and now it was basically working, he wanted to write tests for it. He said he was having difficulty since he’d written a lot of static “helper” methods. I advised him that static methods make code quite hard to test, and can often be a sign of a not very good object oriented design.

He suggested we should invest in a fancy mocking tool that would enable him to easily replace these static methods in the tests. I told him a better investment would be for him to learn to write the tests first, get better at OO design, and not use static methods in the first place. I was probably a bit blunt, and he was quite polite, all things considered. He protested that he shouldn’t have to change the production code in order to get it under test, then left. That was the last time he came to me for advice.

Discussion

So does doing TDD guarantee better design? Well it should certainly help. I’ve presented before about the way TDD gives you early feedback on your design and plenty of opportunities to refactor. It’s less help though if you don’t know what a good design looks like in the first place. I think J.B. goes too far in his claims – if you don’t know MVC or SOLID principles then I’d be surprised if they started turning up in your code with any consistency.

No tool nor technique can survive inadequately trained developers” 

(A quote attributed to Steve Freeman). I think you do need to invest in learning good design techniques independently of TDD. If you lack basic OO design skills you probably won’t be able to do TDD in the first place, London School or otherwise.

I’ve been learning and improving my practice of TDD, including the London School, for many years now, and I was intrigued by Luca’s claims that it led to better adherence to SOLID principles than classic TDD. The London School involves an outside-in approach to design, that makes heavy use of mocks to check interactions between objects. This is in contrast to a more classic TDD style that prefers to verify the code works by checking the state of an object after an interaction.  I wouldn’t claim to be an expert in the London School of TDD, but I think I understand the basics and can adopt this style when I feel the problem is appropriate for it.

I tried out Luca’s four problems, (here on github) to see how I did. Luca very kindly gave me some feedback on my code, and I found hadn’t done as well as I had hoped to in adhering to SOLID principles. I’d got the code under test, but in a few places I could have improved the design more. I also slightly misunderstood the requirements for two of the problems, which led me to fork the repo and improve the instructions 🙂

I think in the cases where I could have done better with the design, it’s possible using the London School of TDD would have led to the improvements. I’m feeling there might be something in Luca’s conjecture. On the other hand, these problems might be so small and abstract, that I didn’t behave the same as I would in a real codebase. Certainly in one case I felt it wasn’t worth extracting an interface when there was only one implementation for it. In a real system maybe it would be more obvious that more implementations were likely, and that adding the interface would lead to a more decoupled design. Or then again maybe I’m just too used to python where expicit interface classes don’t tend to be used. Or maybe I’m just making excuses! In any case, doing these exercises has made me more interested to improve my knowledge and practice of the London School TDD style. 

I think these exercises are interesting little code katas in their own right, quite apart from Luca’s study on TDD. I think you can use them to learn about the SOLID principles, and practice some of the refactorings you often have to do to get badly designed code under test.

I’m working on a python translation of the exercises so we can try them out at the Gothenburg Python User Group meeting next week. Feel free to fork the repo and have a go at them yourself.

For a little while now I’ve been collecting Refactoring Kata exercises in a github repo, (you’re welcome to clone it and try them out). I’ve recently facilitated working on some of these katas at various coding dojo meetings, and participants seem to have enjoyed doing them. I usually give a short introduction about the aims of the dojo and the refactoring skills we’re focusing on, then we split into pairs and work on one of these Refactoring Katas for a fixed timebox. Afterwards we compare designs and discuss what we’ve learnt in a short retrospective. It’s satisfying to take a piece of ugly code and after only an hour or so make it into something much more readable, flexible, and significantly smaller.

Test Driven Development is a multifaceted skill, and one aspect is the ability to improve a design incrementally, applying simple refactorings one at a time until they add up to a significant design improvement. I’ve noticed in these dojo meetings that some pairs do better than others at refactoring in small steps. I can of course stand behind them and coach to some extent, but I was wondering if we could use a tool that would watch more consistently, and help pairs notice when they are refactoring poorly.

I spent an hour or so doing the GildedRose refactoring kata myself in Java, and while I was doing it I had two different monitoring tools running in the background, the “Codersdojo Client” from http://content.codersdojo.org/codersdojo_client/, and “Sessions Recorder” from Industrial Logic. (This second tool is commercial and licenses cost money, but I actually got given a free license so I could try it out and review it). I wanted to see if these tools could help me to improve my refactoring skills, and whether I could use them in a coding dojo setting to help a group.

Setting it up and recording your Kata
The Codersdojo Client is a ruby gem that you download and install. When you want to work on a kata, you have to fiddle about a bit on the command line getting some files in the right places, (like the junit jar), then modify and run a couple of scripts. It’s not difficult if you follow the instructions and know basicly how to use the command line. You have a script running all the time you are coding, and it runs the tests every time you save a source file.

The Sessions Recorder is an Eclipse plugin that you download and install in the same way as other Eclipse plugins. It puts a “record” button on your toolbar. You press “record” before you start working on the kata.

Uploading the Kata for analysis
When you’ve finished the kata, you need to upload your recording for analysis. With the Codersdojo Client, when you stop the script, it gives you the option of doing the upload. When that’s completed it gives you a link to a webpage, where you can fill in some metadata about the Kata and who you are. Then it takes you to a page with the full final code listing and analysis.

The Sessions Recorder is similar. You press the button on the Eclipse toolbar to stop recording, and save the session in a file. Then you go to the Indutrial Logic webpage, log into your account, and go to the page where you upload your recorded Session file. You don’t have to enter any metadata, since you have an account and it remembers who you are, (you did pay for this service after all!) It then takes you to a page of analysis.

Codersdojo Client Analysis
The codersdojo client creates a page that you can make public if you want – mine is available here. It gives you a graph like this:

Screen Shot 2012-08-16 at 09.08.08

It’s showing how long you spent between saving the file, (a “move”), and whether the tests were red or green when you saved. There is also some general statistics about how long you spent in each move, and how many modifications there were on average. It points out your three longest moves, and has links to them so you can see what you were doing in the code at those points.

I think this analysis is quite helpful. I can see that I’m going no more than two or three minutes between saving the file, and usually if the tests go red I fix them quickly. Since it’s a refactoring kata I spend quite a lot of moves at the start where it’s all green, as I build up tests to cover the functionality. In the middle there is a red patch, and this is a clear sign to me that I could have done that part of the kata better. Looking over my code I was doing a major redesign and I should have done it in a better way that would have kept the tests running in the meantime.

Towards the end of the kata I have another flurry of red moves, as I start adding new functionality for “Conjured” items. I tried to move into a more normal TDD red-green-refactor cycle at that point, but it actually doesn’t look like I succeeded very well from this graph. I think I rushed past the “green” step without running the tests, then did a big refactoring. It worked in the end but I think I could have done that better too.

Sessions Recorder Analysis
The Sessions Recorder produces a page which is personal to me, and I don’t think it allows me to share it publicly on the web. On the page is a graph that looks like this:

anychart

As you can see it also shows how long I spend with passing and failing tests, in a slightly different way from the Codersdojo Client’s graph. It also distinguishes compiler errors from failing tests, (pink vs red).

This graph also clearly shows the areas I need to improve – the long pink patch in the middle where I do a major redesign in too large a step, and at the red bit at the end when I’m not doing TDD all that well.

The line on the graph is a “score” where it is awarding me points when I successfully perform the moves of TDD. Further down the page it gives me a list of the “events” this score is based on:

Screen Shot 2012-08-16 at 09.39.32

(This is just some of the events, to show you the kinds of things this picks up on.) “New Green Test” seems to score zero points, which is a bit disappointing, but adding a failing test gets a point, and so does making it pass. “Went green, but broke other tests” gets zero points. It’s clearly designed to help me successfully complete red-green-refactor cycles, not reward me for adding test coverage to existing code, then refactoring it.

There is another graph, more focused on the tests:

Screen Shot 2012-08-16 at 09.44.53

This graph has mouseover texts so when you hover over a red dot, it shows all the compilation errors you had at that point, and if you hover over a green dot it tells you which tests were passing. It also distinguishes “compler errors” from a “compiler rash”. The difference is that a “compiler rash” is a more serious compilation problem, that affects several files.

You can clearly see from this graph that the first part of the kata I was building up the test coverage, then just leaning on these tests and refactoring for the rest. It hasn’t noticed that I had two @Ignore ‘d tests until the last few minutes though. (I added failing tests for Conjured Items near the start then left them until I had the design suitably refactored near the end).

I actually found this graph quite hard to use to work out what I need to improve on. There seem to be three long gaps in the middle, full of compilation errors where I wasn’t running the tests. Unlike with the Codersdojo Client, there isn’t a link to the actual code changes I was making at those points. I’m having trouble working out just from the compiler errors what I should have been doing differently. I think one of these gaps is the same major redesign I could see clearly in the Codersdojo Client graph as a too big step, but I’m not so sure what the other two are.

There are further statistics and analysis as well. There is a section for “code smells” but it claims not to have found any. The code I started with should qualify as smelly, surely? Not sure why it hasn’t picked up on that.

Conclusions
I think both tools could help me to become better at Test Driven Development, and could be useful in a dojo setting. I can imagine pairs comparing their graphs after completing the kata, discussing how they handled the refactoring steps, and where the design ended up. Pairs could swap computers and look through someone else’s statistics to get some comparison with their own.

The Codersdojo Client is free to use, and works with a large number of programming languages, and any editor. You do have to be comfortable with the command line though. The Sessions Recorder tool only supports Java and C# via Eclipse. It has more detailed analysis, but for this Refactoring Kata I don’t think it was as helpful as it could have been.

The other big difference between the tools is about openness. The Sessions Recorder keeps your analysis private to you, and if you want to discuss your performance, it lets you do so with the designers of the tool via a “comment on this page” function. I havn’t tried that out yet so I’m not sure how it works, that is, whether you get feedback from a real person as well as the tool.

The Codersdojo Client also lets you keep your analysis private if you want, but in addition lets you publish your Kata performance for general review, as I have done. You can share your desire for feedback on twitter, g+ or facebook. People can go in and comment on specific lines of code and make suggestions. That wouldn’t be so needed during a dojo meeting, but might be useful if you were working alone.

Further comparison needed
On another occasion I tried out the Sessions Recorder on a normal TDD kata, and found the analysis much better. For example this graph of me doing the Tennis kata from scratch:

anychart (1)

This shows a clear red-green pattern of small steps, and steadily increasing score rewarding me for doing TDD correctly. Unfortunately I didn’t do a Codersdojo Client session at the same time as this one, for comparison. A further blog post is clearly needed for this case… 🙂

Please note – As of March 2013, I have rewritten this post in the light of further experience and discussions. The updated post is available here.

I feel like I’ve spent most of my career learning how to write good automated tests in an agile environment. When I downloaded JUnit in the year 2000 it didn’t take long before I was hooked – unit tests for everything in sight. That gratifying green bar is near-instant feedback that everthing is as expected, my code does what I intended, and I can continue developing from a firm foundation.

Later, starting in about 2002, I began writing larger granularity tests, for whole subsystems; functional tests if you like. The feedback that my code does what I intended, and that it has working functionality has given me confidence time and again to release updated versions to end-users.

Often, I’ve written functional tests as regression tests, after the functionality is supposed to work. In other situations, I’ve been able to write these kinds of tests in advance, as part of an ATDD, or BDD process. In either case, I’ve found the regression tests you end up with need to have certain properties if they’re going to be useful in an agile environment moving forward. I think the same properties are needed for good agile functional tests as for good unit tests, but it’s much harder. Your mistakes are amplified as the scope of the test increases.

I’d like to outline four principles of agile test automation that I’ve derived from my experience.

Coverage

If you have a test for a feature, and there is a bug in that feature, the test should fail. Note I’m talking about coverage of functionality, not code coverage, although these concepts are related. If your code coverage is poor, your functionality coverage is likely also to be poor.

If your tests have poor coverage, they will continue to pass even when your system is broken and functionality unusable. This can happen if you have missed out needed test cases, or when your test cases don’t check properly what the system actually did. The consequences of poor coverage is that you can’t refactor with confidence, and need to do additional (manual) testing before release.

The aim for automated regression tests is good Coverage: If you break something important and no tests fail, your test coverage is not good enough. All the other principles are in tension with this one – improving Coverage will often impair the others.

Readability

When you look at the test case, you can read it through and understand what the test is for. You can see what the expected behaviour is, and what aspects of it are covered by the test. When the test fails, you can quickly see what is broken.

If your test case is not readable, it will not be useful. When it fails you will have to dig though other sources outside of the test case to find out what is wrong. Quite likely you will not understand what is wrong and you will rewrite the test to check for something else, or simply delete it.

As you improve Coverage, you will likely add more and more test cases. Each one may be fairly readable on its own, but taken all together it can become hard to navigate and get an overview.

Robustness

When a test fails, it means the functionality it tests is broken, or at least is behaving significantly differently from before. You need to take action to correct the system or update the test to account for the new behaviour. Fragile tests are the opposite of Robust: they fail often for no good reason.

Aspects of Robustness you often run into are tests that are not isolated from one another, duplication between test cases, and flickering tests. If you run a test by itself and it passes, but fails in a suite together with other tests, then you have an isolation problem. If you have one broken feature and it causes a large number of test failures, you have duplication between test cases. If you have a test that fails in one test run, then passes in the next when nothing changed, you have a flickering test.

If your tests often fail for no good reason, you will start to ignore them. Quite likely there will be real failures hiding amongst all the false ones, and the danger is you will not see them.

As you improve Coverage you’ll want to add more checks for details of your system. This will give your tests more and more reasons to fail.

Speed

As an agile developer you run the tests frequently. Both (a) every time you build the system, and (b) before you check in changes. I recommend time limits of 2 minutes for (a) and 10 minutes for (b). This fast feedback gives you the best chance of actually being willing to run the tests, and to find defects when they’re cheapest to fix.

If your test suite is slow, it will not be used. When you’re feeling stressed, you’ll skip running them, and problem code will enter the system. In the worst case the test suite will never become green. You’ll fix the one or two problems in a given run and kick off a new test run, but in the meantime someone else has checked in other changes, and the new run is not green either. You’re developing all the while the tests are running, and they never quite catch up. This can become pretty demoralizing.

As you improve Coverage, you add more test cases, and this will naturally increase the execution time for the whole test suite.

How are these principles useful?

I find it useful to remember these principles when designing test cases. I may need to make tradeoffs between them, and it helps just to step back and assess how I’m doing on each principle from time to time as I develop.

I also find these principles useful when I’m trying to diagnose why a test suite is not being useful to a development team, especially if things have got so bad they have stopped maintaining it. I can often identify which principle(s) the team has missed, and advise how to refactor the test suite to compensate.

For example, if the problem is lack of Speed you have some options and tradeoffs to make:

  • Invest in hardware and run tests in parallel (costs $)
  • Use a profiler to optimize the tests for speed the same as you would production code (may affect Readability)
  • push down tests to a lower level of granularity where they can execute faster. (may reduce Coverage and/or increase Readability)
  • Identify key test cases for essential functionality and remove the other test cases. (sacrifice Coverage to get Speed)

Explaining these principles can promote useful discussions with people new to agile, particularly testers. The test suite is a resource used by many agile teamembers – developers, analysts, managers etc, in its role as “Living Documentation” for the system, (See Gojko Adzic‘s writings on this). This emphasizes the need for both Readability and Coverage. Automated tests in agile are quite different from in a traditional process, since they are run continually throughout the process, not just at the end. I’ve found many traditional automation approaches don’t lead to enough Speed and Robustness to support agile development.

I hope you will find these principles will help you to reason about the automated tests in your suite.

I’ve heard Joseph Wilk speak before, so I was delighted when he accepted our invitation to come to Scandinavian Developer Conference and share some ideas around “Acceptance testing in the land of the Startup”. In this post I aim to summarize my understanding of what Joseph said. I think what he is doing at Songkick is really interesting and innovative, and is applicable beyond that particular environment.

Introduction

Joseph began by introducing what he meant by “Startup” using Eric Ries‘ definition:

“a human institution designed to create new products and services under conditions of extreme uncertainty”

“Acceptance testing” is usually about checking the product meets the previously agreed requirements – but if everything is changing all the time, that is not as simple as it sounds. Joseph’s acceptance tests reflect what “done” looks like in conditions of extreme uncertainty.

Joseph talked about the Cynefin model (by Dave Snowden) and how working in a startup feels like you’re in chaos a lot of the time, although parts of the business may be merely complicated or even simple. An effective strategy in the “chaos” part of the Cynefin model is “Act, Sense, Respond”, which he related to the lean startup cycle of “Build, Measure, Learn”.

The “Build” part of that cycle for their software product often takes more time than the other parts, and work should be coordinated between several people. During the “Build” phase you can use automated tests to help with communication and feedback. In the “Measure” phase, you’re primarily looking for feedback from the actual end users, rather than automated tests.

The rest of the talk was largely about how Songkick use automated tests during the “Build” phase, and how they do acceptance testing with end users in the “Measure” phase.

The startup

The software Joseph is working on is for the startup Songkick which is devoted to live music. They are about 20-30 people, and the majority are programmers. They have two dedicated UX specialists and two QA specialists. Most people are multi-skilled, care deeply about the quality of the product, and are not afraid to learn how to help outside of their speciality.

Songkick holds three principles dear:

  • Embrace Change
  • Usability
  • Testing

The rest of his talk was about the concrete practices they use around this last principle – Testing. Joseph stressed that he was reporting what they had found to work, and not holding up any “best practice” or industry gold standard.

Features

At Songkick, they use Behaviour Driven Development. They are continually building and improving a “ubiquitous language” to describe their system, by writing Features and Scenarios in a semi-formal syntax. The tool Cucumber is used to turn these features and scenarios into Executable Specifications (aka Automated Tests).

Joseph gave a short intro to how Cucumber works – take a look at this site if you’re not familiar with it. In short, Cucumber lets you define many “Features”, each of which define numerous “Scenarios” which exemplify the desired feature behaviour. It requires you use a particular syntax called “Gherkin”. The gist of it is that you use the following formats when writing features and scenarios:

Feature:
“In order to… I want… So that…”

Scenario:
“Given… When… Then…”

If you put in some extra work and also write “Step Definitions”, your scenarios become executable, and can automatically check the application behaves as expected when run.

The primary purpose of the Cucumber specifications is to promote communication within the team: the words are carefully chosen to have clear meanings related to the domain of the system you’re building.

Hypotheses

In Lean Startup you don’t talk so much about building a “Feature”, more about having a “Hypothesis”. As in “if we add a pink button here, do we get more traffic to the site?”

Joseph showed the following example of a Feature with a Hypothesis:

Facebook signup
In order to increase signup
I want visitors to sign up via Facebook
So that we see a 10% increase in signups

That last line is the measurable part of the hypothesis, that makes it more than just a normal Cucumber Feature. When you’ve delivered a system that has this feature, it should lead to 10% more signups. If it doesn’t, the system might be modified to remove this feature. Features written in this form (In order to… I want… So that…) are prioritized in the product backlog.

When a developer is ready to start working on a feature, she calls a meeting to discuss it. At the meeting there should be representatives from four different job roles: a developer, tester, business analyst and usability expert. Together they discuss what the feature is, the details of how it should work, and perhaps create some wireframes or UI mockups. They will also brainstorm perhaps 7-8 scenarios, such as:

  • successful signup via facebook
  • failed signup via facebook
  • facebook changes their API and no one can sign in any more

After the meeting, the developer starts working on the code to implement the feature, bearing in mind the discussion and the scenarios. The people in the other job roles will also get involved in development, and assessing whether the feature is ready for deployment. Some or all of the scenarios they have come up with will be automated as executable with Cucumber.

There is an overhead to automating scenarios with Cucumber: implementing “Step Definitions” takes time, and afterwards the scenarios are relatively slow to execute. Joseph said that they were more and more only automating a few scenarios for each feature, often just a happy and a sad path. The other scenarios would perhaps be used in unit tests, or exploratory testing. The most important aspect of writing many scenarios was to discover issues with a feature before implementation.

All their Cucumber features and scenarios are published in a searchable web interface called “Relish” which he said encouraged developers to use more business-friendly language, since they knew non-programmers would look there. Joseph said features containing incomprehensible technobabble like “Given the asynchronous message buffer queue has been emptied” (!) do appear occasionally, but don’t last long. Their ubiquitous language is highly visible and in daily use.

Actual Acceptance Tests

Songkick uses a Continuous Delivery approach, where anyone in the company can deploy the latest code, at the push of a button. They have invested a lot of effort to make it all automated, so that when anyone checks in code changes, Jenkins builds it, runs all the tests, (including Cucumber specs), and allows anyone to deploy any successful build. He said deployment was so easy even a dog could do it, (and had on one occasion!).

Joseph explained that many features would be built in such a way that they were “turned off” in production for a few days while they were being built. Only once the whole feature was implemented and present would they activate the code, and perhaps only then for a subset of all the visitors to the site. In this way you can build a feature that takes a few days to implement, and continue to release the new code every few hours. The half built feature is being deployed, it’s just switched off.

The reason for all this effort to get code out of the door is that a feature can’t be considered Done until its hypothesis has been validated. As Joshua Kerievsky said:

“A story isn’t done until it’s being used by real users in production and has been validated to be a useful part of a product”

By deploying often, you get that confirmation or rejection as soon as possible, so you can evaluate whether the feature was worth building. “Build, Measure, Learn”. The “Measure” part happens after an acual customer has got hold of the code and started using it. That is the real acceptance testing that’s going on, not the Cucumber scenarios.

So that’s the way they work today, but it hasn’t always been quite like that. Joseph also talked about why they have arrived at this process.

The build time issue

The Cucumber scenarios are run at every checkin, and Joseph said that at one point it took 4 hours to run them all. This was a serious problem, delaying feedback and slowing the continuous deployment process to a crawl.

Their first approach was to use their technical skills to divide up the tests and run them in parallel on the cloud. The build time went down from 4 hours to just 16 minutes, but at a cost of about $7000 per month in cloud servers. That proved too expensive to be sustainable, so they reconsidered.

The automated tests are there to give confidence that the system works before you deploy. They realized that some of the tests gave more confidence than others, so they removed something like 60% of them. These less valuable tests were then run much less frequently. This took them off the critical path to deployment, and enabled more frequent releases. So far this strategy has led to only insignificant problems in production.

They also re-evaluated their policy for which scenarios to automate in Cucumber, and started putting much more emphasis on unit tests. When you have comprehensive Cucumber tests, it’s easy to have lots of confidence that the system works, and not bother with unit tests. Joseph said they were missing the design benefits of unit testing though – unit tests force your code to be decomposed into units! These days they write fewer Cucumber scenarios, and more unit tests.

Duplicating effort with QA

In addition to automated Cucumber scenarios, the QA people do manual testing before deployment. At one point the developers were surprised to learn than maybe 60-70% of these manual tests were covering functionality that was already well covered by Cucumber specs. They realized that the QA people had little confidence in the automated tests.

Once they had noticed the problem, they began to include the QA people more in the scenario writing process, so they could learn about what the tests did, and how they work. This helped give QA confidence to remove some of the manual tests, so now they only test the most crucial functionality manually before deployment.

Metrics as feedback

Joseph invented a tool called “Limited Red” that records metrics from every test run. It shows failure rates for each scenario, and correlates that with changes in the corresponding feature file, using the Git log. Using this data, he can plot graphs that show each feature, how many times it has failed, and how often it has been edited.

For example, he might find one of the features has failed 16 times, while the code in the feature was only changed 4 times. This could indicate the feature is testing an area of the code that is poorly implemented – the code is broken more often than the test is invalid. This gives developers feedback that they can act on to improve the quality of both the code and the Cucumber scenarios.

Joseph has a blog post with more detail about this tool and metrics approach.

My Concluding Thoughts

Most startups fail, and I have little insight into whether Songkick will be successful in the long run. I am fascinated though, by the way they are working. It seems to me they have a great process that promotes communication and feedback, coupled with enough introspection to adapt it as they learn more.

I’m interested to note that Songkick has extended the “Rule of Three” introduced by Lisa Crispin and Janet Gregory. In their book on “Agile Testing” they encourage you to include a tester in all discussions between a developer and customer. At Songkick they appear to have business analysts in the customer role, and additionaly include a UX (usability) expert in the discussion.

Joseph didn’t mention it, but I suspect that his experiences automating fewer scenarios with Cucumber may be one of the reasons Liz Keogh wrote her blog post “Step away from the tools“. Or maybe the blog came first, I don’t know. The advice is the same, in any case – BDD is about communication first, test automation second.

The “Limited Red” tool is quite new, and I think Joseph is still learning how to act on the data he’s gathering. Having said that, I have high hopes that eventually this kind of measurement and feedback will find its way into more automated testing tools and Jenkins builds. It seems to me that finding out which of your tests are giving the best value for money and which are costing the most to maintain will be generally useful.

My next challenge is to work out how to apply these kinds of ideas to the situation I find myself in right now. It’s a large company not a startup, the customer seems totally inaccessible, the release cycle is years not hours, the build is slow (when it works at all), and UX experts are thin on the ground. Hmm. As Joseph said, his story is about what they’ve found to work, not some universal “best practices”. It’s given me some new ideas and encouragement – now I just have to work out what principles and practices I can reasonably introduce where I am.

Notes 2013-5-13: one of Joseph’s colleagues wrote a post “From 15 hours to 15 seconds” explaining how they got the build time so low, a few months after I wrote this post. I also got a note from Liz Keogh explaining that her blog post “Step away from the tools” was written well before the events at Songkick, in March 2011, and that she consulted with them briefly after that, discussing the build time problem amongst other issues.