Archive for the ‘Experience Report’ Category

Last week I met Woody Zuill when he came to Göteborg to give a workshop about Mob Programming.  At first glance mobbing seems really innefficient. You have a whole team of maybe 6-7 people sitting together all day, every day, programming at one computer. How could that possibly be a productive way to work?

I’m pretty intrigued by the idea. It reminds me of the reaction people had to eXtreme Programming when they first heard about it back in like 2000. Is it just an off-putting name for something that could actually be quite brilliant? There are certainly some interesting people who I respect, talking warmly about it. The thing is, when it comes to working together with others, programming at one computer, I’ve had some mixed results. Sometimes good, sometimes less good.

I’ve done some pair programming, and found it worked well with some people and not others. It’s generally worked much better when I’ve paired with someone who has a lot of useful knowledge that I’ve lacked. Either about the language and frameworks we’re using, or about how the software will be used – ie the problem domain. I’ve found it’s worked a lot less well in other situations, with other people. I find it all too easy to hog the keyboard, basically. So I do pair, but not that often.

With ‘Randori’ style coding dojos, the idea is that you have a pair at the front who code, and switch one person every 5-7 minutes, or every test case. I’ve facilitated a lot of these sessions, and I find them especially useful for quickly getting a group of people new to TDD all up and running and pointed in the same direction. Recently I’ve been doing it only for the first session or two, instead having everyone working in pairs most of the time. As a facilitator, this is far easier to handle – much less stressful. Managing the interactions in a bigger group is difficult, both to keep the discussions on track, answer questions about the exercise, and to maintain the pair switching. I also find the person at the keyboard easily gets stressed and intimidated by having everyone watching them, and often writes worse code than they are capable of. So I do facilitate whole-group Randori sessions, but not that often.

So I wanted to find out if mob programming had similar strengths and weaknesses. In what situations does it excel, and when are you better off pairing or working alone? Would I find it stressful, like a Randori? Would I want to drive most of the time, as in pair programming?

Woody turns out to be a really gentle person, about as far away from a ‘hard sell’ as you can get. He facilitaed the session masterfully, mixing theory and practice, telling us stories about what he’s found to work and why. I am confident he knows a lot about software development in general, mob programming in particular, and he is very humble about it.

The most important insight I gained from the session, was that I need to get good at ‘strong-style pairing’. That seems to me to be at the heart of what makes Mob programming work, and not be stressful like the Randori sessions I’ve been doing. I think it will also help me to get pair programming to work well in a wider variety of situations.

I have heard about ‘strong-style pairing’ before, from Llewellyn Falco, who invented it, but I havn’t really experienced it very much before, or understood how important it is. Do go read his blog post about it, for a fuller explanation of what I’m talking about.

The basic idea is “For an idea to go from your head into the computer it MUST go through someone else’s hands”. That forces you to express your ideas really clearly, in words, first. That is actually pretty difficult when you havn’t done it much before. The thing is, that if you do that, then you open up your programming ideas for discussion, critique, and improvement, in a way that doesn’t happen if they go straight from your head through your own hands into the computer. I think if I get better at ‘strong-style pairing’ it will help me not only with Mob programming, but also with pairing and facilitating Randori dojo sessions. Probably also with programming generally.

Pairing has worked best for me when I’ve been driving, and my navigator is good at expressing ideas for me to understand and then type. I think I need to get good at that navigator role for the times when I’m the one with more ideas. I need to learn that when I think ‘I have an idea about to solve this problem!’ I should hand over the keyboard, not grab it. I need to learn to express my coding ideas verbally. Then I will be able to pair productively with a wider range of people.

Randori sessions are much less stressful if the driver has less to do. If the responsibility is shared more evenly with the Navigator, then I think everyone will write better code. As a facilitator, I have less group dynamics to worry about if the designated navigator is in control, and everyone else talks less. (Woody advised that, at least at first, you should ban anyone else in the mob from giving the Driver ideas about what to type, so the Navigator learns the role.)

So thanks, Woody, for taking the time to come to Göteborg, sharing your experiences and facilitating a great workshop. I learnt a lot, and I think Mob Programming and Strong-style pairing could quite possibly be some of those brilliant ideas that change the way I write code, for the better.

I was recently at the Software Craftsmanship Conference at Bletchley Park in the UK. This is a one-day conference for software developers, attended by around 150 programmers. All proceeds from the event go to support Bletchley Park, which is of historical interest to programmers in particular – the site where Alan Turing and others cracked the enigma code in the 2nd world war. It was the fifth time this conference has been run, and the first time I attended. This is a short experience report.

In the morning I ran a workshop titled “Outside-In, with or without Mocks?“. We were about 50 people in the Ballroom in the Mansion, a very grand room, and it was really great to see so many people working in pairs at laptops, puzzling over some code and tests and how to do Test Driven Development. We were looking at a code kata I’ve designed called “Train Reservation“. It’s in no way a beginner exercise, and the crowd at Bletchley seemed to get on with it rather well on the whole. I’m just sorry I didn’t get round to talk to each pair very often, with 24 pairs I only had a couple of conversations with each during the 2 hour session!

I set up the exercise more or less to force people to use some kind of mock, fake or stub to replace the Booking Reference Service and the Train Data Service, because I am interested in how different people use these. I’ve observed that some programmers avoid using test doubles whenever possible, while others use them frequently. I’ve also observed that some people prefer to work outside-in, starting with a guiding test, while others prefer to start with the business rules at the heart of the problem and work outwards from there. At this particular workshop, there were all sorts of approaches being used. Some started with the guiding test and stubbed the services. Others started with the business logic around the seat selection rules. Different approaches, as I had hoped! Overall I feel encouraged that this exercise is a useful one, and people seemed to get on better with it than the last time I ran it, at XP2013. It’s till rather too big of a problem to tackle in a half day workshop though. I’ll be updating it some more before I run it again, although I don’t have any fixed plans for when that will be yet.

In the afternoon, I went to a session by Ivan Moore and Mike Hill, “Inheritance to Composition“. They gave us a demo of this particular refactoring using a very simple codebase, before launching us into a much more complex one – Fitnesse (starting from the branch “revised-ResponderFactory”). The idea was to take some classes that were using Inheritance – specifically the Template Method pattern – and convert them to instead use Composition – specifically the Strategy pattern. They also helpfully provided us with a sheet of instructions – 6 steps to complete the refactoring with minimal risk and code breakage.

My pair and I got on fairly well with the refactoring, and by the end of the session we were on step 5 with the goal in sight. The experience was of using Eclipse’s refactoring tools extensively, and relying a great deal on the compiler. The tests we had to lean on took a minute and a half to run, and actually, the tests for the classes we were working on were more mini-integration tests than unit tests as such. It meant there were relatively few updates to the tests as we did the refactoring, but the feedback loop was slow. I thought that was really interesting, and was wondering how the experience of the refactoring would change in a language like Python. There you don’t have a compiler, or very much help from refactoring tools.

So after the workshop, I set about trying to construct a similar problem in Python. Perhaps understandably, I didn’t want to translate the whole of Fitnesse to Python, (!), so I tried to re-write only the elements of it essential to this exercise. You can have a look at what I’ve come up with in my new repo “WikiSearchKata“. I’m still working on preparing this properly as an exercise, (the instructions are still rather thin), but I plan to try it out at a GothPy meeting sometime soon.

After the conference sessions had ended, we were treated to a guided tour of the National Museum of Computing which was for me, the highlight of the day! Our enthusiastic guide showed us all sorts of ancient computers and storage devices and punch cards… a few I recognized from my childhood. My dad used to bring home old punch cards and my mum used to write her shopping lists on them when she went to the supermarket. They had a 48K ZX spectrum with rubber keys – just the same as the one I wrote my first program on! They had a CRAY supercomputer similar to the one I remember seeing once when I visited my dad’s work as a child. It’s a similar size to (the outside view of) a Tardis, with a big red button on the front. I don’t think we found out what the red button does, but the guide did say we probably have more computing power in the smartphone in our pocket! I found the changes in storage capacity actually even more impressive. They had these washing-machine sized boxes and dinner-plate sized metal disks that together made a hard drive. I think it held something like 4K.

The highlight of the tour was the WITCH computer – the oldest working computer in the world. It was brilliant! You could actually see what it was doing while it read in a paper tape punched with holes – the program – and loaded values into registries and did calculations. It made this fantastic whirring noise as it ran, and has all these little whizzy flashing lights. It works in decimal rather than binary, so each number is represented by a little “dekatron” – a glass tube with a red light inside, that moves between positions 0-9 in a circle. So you can read which number is in the registry by looking at the position of each light in the array. They also had this little button you could press to make it step through the program one instruction at a time. I got to press it, and single-step a computer from 1951!

Compared with other conferences I’ve been too, this one was rather short, just one day, and with rather long sessions – half or whole day. It was hard work coding and facilitating all day, but in general very interesting people and coding exercises. A second day would have made it more worthwhile my making the trip. In any case, my thanks to Jon Dickinson for organizing it.

Last week I was in Oxford at “Iverson College”, which is a conference on the topic of Array Language Programming. There were about 25 programmers there, most of whom are expert in one or more of APL, J, K, or Q. It’s not my usual comfort zone, put it that way! I’m fairly competent with a number of programming languages, notably Python and Java, but nothing I know is really much like these array languages. It’s been a huge culture shock, but in a good way, I think.

My main discoveries are that Array Programming is different again from Object Oriented Programming and Functional Programming, (although it has a lot in common with functional programming), and that this community contains some exceptional programmers. The total number of array language programmers is however extremely small and their work seems to be pretty much unknown to the wider programming community.

Array Programming Languages
I mentioned before four languages, APL, J, K and Q. They are similar to each other, kind of like Ruby and Python are similar to each other. I’ve gone through an introductory training in each language this week, largely given by the language designers themselves. I’d like to relate a little of what I’ve discovered about them.

APL
This is the oldest of the array languages, invented by Ken Iverson in the 1960s. It’s notorious for using an alphabet of funny-looking symbols to represent the built-in functions. You can try it out at http://tryapl.org – an interactive REPL (Read-Evaluate-Print-Loop) where you can put in snippets of code and see what the symbols do.

I thought at first that APL looked really intimidating and unnecessarily weird. Now having got to know it a little, I can see the benefits to the little symbols. They make the code really concise and unambiguous, and it doesn’t take long to learn their names. Once you can pronounce each symbol in your head as you read the code, it’s not much different from writing out the names in full in the editor.

The variant of APL that most of the conference attendees use is produced by the company Dyalog. I first met the CTO, Morten Kromberg, at an XP conference in 2006. He’s shown me some APL before, but this time I really got a chance to sit down with him and look at how he writes code. Dyalog APL has a powerful IDE including a REPL, where Morten showed me how he plays around with data and code, in order to come up with some useful APL expressions. When he’d got something working, he transfers code from the REPL into a file, to make it re-usable and shareable. It’s a familiar way of working to me, many Python programmers code this way, flipping between the REPL and a script file. It was a real pleasure to code with Morten – he is an extremely skilled programmer. Dyalog APL looks nice too, it has a fully-fledged IDE, and interfaces with .Net, Excel spreadsheets, ASP.net and more. It would fit nicely into the technology stack of many IT departments basically.

J
This was Ken Iverson’s next language, created together with Roger Hui, who now continues development of it. J is similar to APL in many ways, but is open source, and uses only ASCII characters. They’ve made an effort to make it open and less intimidating to newcomers, and probably for that reason, it’s the one I chose to download and try to learn before the conference.

I met Roger at breakfast on the first day of the conference, knowing nothing about who he was, he just said he was a programmer. I confessed that I’d downloaded J and made some joke about hoping I’d get on better with it than Ron Jeffries. (Ron wrote articles in his blog, about his efforts to learn J, and later gave up, finding it too hard!). Roger genuinely didn’t know who Ron Jeffries is, although he did know of the agile manifesto. He was very kind and concerned to help me to understand J though, (and Ron, if he wants!)

Despite my head start with J, by the end of the conference I found APL code easier to grasp – J seems more extreme to me. Roger calls J “executable mathematical notation”, and I’ve always been a bit more of an engineer than a mathematician.

K
K was invented by Arthur Whitney, who was also at the conference. I didn’t really get a feel for how the language works, more than that it’s extremely terse.

Arthur gave a talk at the conference, about his new project, KOS. He and two other guys are writing an operating system pretty much from scratch, using K, C, and bits of the linux kernel, (although they’d like to remove those). He showed us how you write applications for this new OS in K, by demonstrating building a text editor. He began from the alpha version of the OS with just a window manager, and a plain new window canvas that didn’t respond to any keyboard or mouse events.

Arthur added a line of K code to let you enter text into the window – a listener to key presses. Then a line of code to move the caret around with the arrow keys. Then a line of code for changing the font size. Then scrolling. Code to handle Ctrl-C and Ctrl-V to copy and paste text…. in the space of less than half an hour, he had an equivalent to notepad working. No compilation, no reloading. And the code was…. phenomenal. You can see a version of it here. I can’t read it really, it looks mostly like line noise to me. All his variable and function names are one or two characters, and K just seems pathologically terse.

I raised my hand and asked Arthur if he thought his code would be more readable if he used longer variable names? He thought for a moment, looking surprised and a little bewildered by the question. Then shook his head and said slowly “No. no. I don’t think I need that. I want to see all my code on the screen at once”. Needless to say, that was a big culture shock moment for me!

The size of the codebase is something all the array language programmers seem really concerned about, even if Arthur’s code is considered extreme even in that community. One thing I did later that week was take a piece of code that is in Robert C. Martin’s book “Clean Code”  (Args.java), as an example of clean Java code, and showed it to the group. There were general exclamations of “aargghh! that hurts my eyes!” but after a little while as I explained the structure of the code they seemed to appreciate it a little better. What they did say that intrigued me though, was that they automatically scanned the page looking for the symbols – the >, !, = signs – the parts that do something, as they put it. The other text they said obscures the structure, it distracts the eye. Yes, that’s right. Having names for the functions and variables makes the code less readable.

KDB+ and Q
KDB+ is a very small and fast commercial database largely used by financial institutions, also originally created by Arthur Whitney. Q is a kind of domain specific language built on top of K, that you use to query the data in a KDB+ instance.

I sat down with Attila Vrabecz, an experienced Q and KDB+ programmer, and we coded together for a couple of hours. We tackled a problem I’d previously coded with Morten in APL, to help me see what was different. There were many similarities – the workflow was the same for example – experimenting in the REPL before transferring the code into a more permanent, reusable form. I noticed Q has many more English words in it, fewer strange symbols, and Attila made more use of library functions than Morten did. It seems Q is designed to be approachable for a former SQL programmer, although once you scratch the surface, it’s much more like APL than SQL.

Test-Driven Development
I gave a talk at the conference about TDD. My aim was to provoke discussion, and argued that writing automated tests using TDD is the best approach. I was definitely successful at sparking a discussion! Actually, it didn’t seem the idea that programmers should write automated tests for their code was all that controversial, especially amongst the more seasoned developers present. We got way more hung up on how large a chunk of code counts as a “unit”, for your unit test, and what clean code looks like in an array language. To my eye, their units are large and their clean code is terse.

A challenge for the future
Dave Thomas, former lead developer for the Eclipse project, and general software visionary, is also an APL and K programmer. He flew in for just one day of the conference, and his talk functioned as the keynote address for the week – it was a clear challenge to the Array Languages community.

Dave painted a vision for the future where people will be living in a sea of big data they don’t understand, and lack adequate tools to query. He sees a great opportunity for array languages, which are generally very good at handling large amounts of data.

He ended his talk with an ambitious challenge to this community to get its act together, start being seen as a credible alternative, and grow. I could only applaud and agree – I found his advice insightful, and I hope the array languages community will do as he suggests.

I spent one very pleasant evening chatting with a woman who is about to embark on her PhD in atmospheric science. She’s hoping to use array languages to help her create software models that will execute quickly on huge arrays of multi-dimensional climate data. Her work sounds fascinating, and I hope it’s a sign of array languages starting to be used beyond their traditional niche in finance.

So I’m leaving the conference carrying a huge tome entitled “A complete introduction to Dyalog APL”, some pieces of code I’ve written, and good intentions to study further. I do find it fascinating that even with the little I know of it, APL allows me to think about and solve a problem differently than I do in Python. I anticipate I’ll find plenty of people in the Array Languages community willing to help me if I do continue to try to learn it. They’re an opinionated, quirky, mature, gentle, yet small bunch of extremely skilled programmers, and I’m glad to have met and coded with them.

On Saturday I was up in Stockholm facilitating my fourth code retreat for Valtech, and my second Global Day of Code Retreat. It seemed to go very well. I tried out a few new elements, which seemed to make it go even better than the previous ones, so I thought I’d talk about them here in my blog.

The first thing we changed was that we had quotas for men and women, and we were aiming for a 50/50 gender balance. About a month before the code retreat, we were fully booked, with 20 places each for men and women taken. In the end several people dropped out at the last minute, and there were slightly more men. Still, it was a much better balance than last year. I think it made for a healthier atmosphere during the day, and we encouraged more women to be active in the community generally.

The second thing we changed was I didn’t require people to do Game of Life, I suggested some other code katas as alternatives. Game of Life is an excellent kata for practicing skills like writing clean code, simple design, and TDD generally. It’s also really fun to be doing an excercise that thousands of other programmers are also working on. However, I realized that around 1/3 of the people who’d signed up for the event had already been to a previous code retreat, and might be getting bored with Game of Life. We’re there to have fun, after all!

As I’ve been working on my book, and running various coding dojos recently, I’ve been thinking hard about TDD and how to learn it. It’s a multifaceted skill, and some katas allow you to focus on particular aspects of it, which can make practice more rewarding. For a complete beginner to TDD, Game of Life is actually quite hard to get started with, I’ve seen a lot of people struggling with it.  So in my introduction I highlighted four other katas people could choose from, as well as Game of Life:

  • String Calculator – good for the TDD newbie, since it really leads you by the hand
  • Tennis – good for practicing refactoring
  • GildedRose – good for practicing writing really good tests (and refactoring)
  • TyrePressure – good for understanding SOLID principles

People seemed to generally appreciate having a choice. Some pairs chose a String Calculator for their first couple of sessions, to get used to TDD, then went on to Game of Life for the others. Some pairs tried out Tennis, Gilded Rose and Tyre Pressure in the afternoon, instead of one of the other challenges. Some pairs who were already good at TDD tried out String Calculator in Clojure, and found it let them concentrate on the learning the language not solving the problem. One person, (who has been to 3 code retreats previously), didn’t do Game of Life all day, and said he really enjoyed himself and learnt loads!

The third thing we changed was that I suggested people try cyber-dojo as a coding environment. This is a tool designed by Jon Jagger as an aid to practicing your coding skills. It provides a basic editor/testing environment, and records the state of the code every time you run the tests. You can review all your changes in the session retrospective, and get a picture of how well your TDD was going.

Before the code retreat, I had set up cyber dojos for Game of Life, and Tennis. I put up the cyber-dojo ids on the whiteboard, and through the day most of the pairs tried it out for one or more sessions.

As a facilitator, I found it really helpful when going up to a pair, to be able to see the little row of traffic light symbols at the top of their screen. I could quickly get an overview of how their session was going. For example, if I see a few red-green-yellow sequences then I can infer they are probably doing ok at TDD. If I can see nothing bug a long string of yellow, then I know they’re in trouble!

For the pairs, some liked cyber-dojo because it meant they didn’t waste any time setting up their coding environment at the start of the session. For those doing Tennis, if they made a refactoring mistake, they could very easily just click on the most recent green traffic light to revert back to a known good state, and practice doing the refactoring again. Some pairs found it helpful in the retrospective to review their session in the tool.

Other pairs didn’t get on so well with cyber-dojo. Some couldn’t live without their usual editor commands. Some got confused by the lack of syntax highlighting. A couple found a bug in cyber-dojo that it lost touch with the server and stopped updating their test results, whatever they did to the code. (The workaround is to open the same url in a new window, btw).

I know Jon is aware of this bug, and I’m sure he’ll track it down, but actually the lack of syntax highlighting and editor commands is deliberate. It’s the same idea as the other challenges, like mute pairing etc – to throw you out of your comfort zone and make you really concentrate on the way you code.

So that was the three things we changed – gender balance, other katas, cyber-dojo. What we didn’t change was the overall format or aims of the day. We still paired for 5 sessions of 45 minutes, and threw away the code afterwards. We still had coding challenges like “Mute pairing with find the loophole” and “Maximum 4 line methods”, and held retrospectives after every session. I think deliberate practice in a group is at the core of what makes Code Retreat (and coding dojos!) fun and valuable. The changes we introduced were all designed to enhance the learning experience, and were based on experience and reflection after previous events.

Overall I was very pleased with all the changes we introduced, and I’d like to do it like that again. If you’re facilitating or organizing a code retreat, perhaps you’d like to introduce some of these elements too.

I’ve heard Joseph Wilk speak before, so I was delighted when he accepted our invitation to come to Scandinavian Developer Conference and share some ideas around “Acceptance testing in the land of the Startup”. In this post I aim to summarize my understanding of what Joseph said. I think what he is doing at Songkick is really interesting and innovative, and is applicable beyond that particular environment.

Introduction

Joseph began by introducing what he meant by “Startup” using Eric Ries‘ definition:

“a human institution designed to create new products and services under conditions of extreme uncertainty”

“Acceptance testing” is usually about checking the product meets the previously agreed requirements – but if everything is changing all the time, that is not as simple as it sounds. Joseph’s acceptance tests reflect what “done” looks like in conditions of extreme uncertainty.

Joseph talked about the Cynefin model (by Dave Snowden) and how working in a startup feels like you’re in chaos a lot of the time, although parts of the business may be merely complicated or even simple. An effective strategy in the “chaos” part of the Cynefin model is “Act, Sense, Respond”, which he related to the lean startup cycle of “Build, Measure, Learn”.

The “Build” part of that cycle for their software product often takes more time than the other parts, and work should be coordinated between several people. During the “Build” phase you can use automated tests to help with communication and feedback. In the “Measure” phase, you’re primarily looking for feedback from the actual end users, rather than automated tests.

The rest of the talk was largely about how Songkick use automated tests during the “Build” phase, and how they do acceptance testing with end users in the “Measure” phase.

The startup

The software Joseph is working on is for the startup Songkick which is devoted to live music. They are about 20-30 people, and the majority are programmers. They have two dedicated UX specialists and two QA specialists. Most people are multi-skilled, care deeply about the quality of the product, and are not afraid to learn how to help outside of their speciality.

Songkick holds three principles dear:

  • Embrace Change
  • Usability
  • Testing

The rest of his talk was about the concrete practices they use around this last principle – Testing. Joseph stressed that he was reporting what they had found to work, and not holding up any “best practice” or industry gold standard.

Features

At Songkick, they use Behaviour Driven Development. They are continually building and improving a “ubiquitous language” to describe their system, by writing Features and Scenarios in a semi-formal syntax. The tool Cucumber is used to turn these features and scenarios into Executable Specifications (aka Automated Tests).

Joseph gave a short intro to how Cucumber works – take a look at this site if you’re not familiar with it. In short, Cucumber lets you define many “Features”, each of which define numerous “Scenarios” which exemplify the desired feature behaviour. It requires you use a particular syntax called “Gherkin”. The gist of it is that you use the following formats when writing features and scenarios:

Feature:
“In order to… I want… So that…”

Scenario:
“Given… When… Then…”

If you put in some extra work and also write “Step Definitions”, your scenarios become executable, and can automatically check the application behaves as expected when run.

The primary purpose of the Cucumber specifications is to promote communication within the team: the words are carefully chosen to have clear meanings related to the domain of the system you’re building.

Hypotheses

In Lean Startup you don’t talk so much about building a “Feature”, more about having a “Hypothesis”. As in “if we add a pink button here, do we get more traffic to the site?”

Joseph showed the following example of a Feature with a Hypothesis:

Facebook signup
In order to increase signup
I want visitors to sign up via Facebook
So that we see a 10% increase in signups

That last line is the measurable part of the hypothesis, that makes it more than just a normal Cucumber Feature. When you’ve delivered a system that has this feature, it should lead to 10% more signups. If it doesn’t, the system might be modified to remove this feature. Features written in this form (In order to… I want… So that…) are prioritized in the product backlog.

When a developer is ready to start working on a feature, she calls a meeting to discuss it. At the meeting there should be representatives from four different job roles: a developer, tester, business analyst and usability expert. Together they discuss what the feature is, the details of how it should work, and perhaps create some wireframes or UI mockups. They will also brainstorm perhaps 7-8 scenarios, such as:

  • successful signup via facebook
  • failed signup via facebook
  • facebook changes their API and no one can sign in any more

After the meeting, the developer starts working on the code to implement the feature, bearing in mind the discussion and the scenarios. The people in the other job roles will also get involved in development, and assessing whether the feature is ready for deployment. Some or all of the scenarios they have come up with will be automated as executable with Cucumber.

There is an overhead to automating scenarios with Cucumber: implementing “Step Definitions” takes time, and afterwards the scenarios are relatively slow to execute. Joseph said that they were more and more only automating a few scenarios for each feature, often just a happy and a sad path. The other scenarios would perhaps be used in unit tests, or exploratory testing. The most important aspect of writing many scenarios was to discover issues with a feature before implementation.

All their Cucumber features and scenarios are published in a searchable web interface called “Relish” which he said encouraged developers to use more business-friendly language, since they knew non-programmers would look there. Joseph said features containing incomprehensible technobabble like “Given the asynchronous message buffer queue has been emptied” (!) do appear occasionally, but don’t last long. Their ubiquitous language is highly visible and in daily use.

Actual Acceptance Tests

Songkick uses a Continuous Delivery approach, where anyone in the company can deploy the latest code, at the push of a button. They have invested a lot of effort to make it all automated, so that when anyone checks in code changes, Jenkins builds it, runs all the tests, (including Cucumber specs), and allows anyone to deploy any successful build. He said deployment was so easy even a dog could do it, (and had on one occasion!).

Joseph explained that many features would be built in such a way that they were “turned off” in production for a few days while they were being built. Only once the whole feature was implemented and present would they activate the code, and perhaps only then for a subset of all the visitors to the site. In this way you can build a feature that takes a few days to implement, and continue to release the new code every few hours. The half built feature is being deployed, it’s just switched off.

The reason for all this effort to get code out of the door is that a feature can’t be considered Done until its hypothesis has been validated. As Joshua Kerievsky said:

“A story isn’t done until it’s being used by real users in production and has been validated to be a useful part of a product”

By deploying often, you get that confirmation or rejection as soon as possible, so you can evaluate whether the feature was worth building. “Build, Measure, Learn”. The “Measure” part happens after an acual customer has got hold of the code and started using it. That is the real acceptance testing that’s going on, not the Cucumber scenarios.

So that’s the way they work today, but it hasn’t always been quite like that. Joseph also talked about why they have arrived at this process.

The build time issue

The Cucumber scenarios are run at every checkin, and Joseph said that at one point it took 4 hours to run them all. This was a serious problem, delaying feedback and slowing the continuous deployment process to a crawl.

Their first approach was to use their technical skills to divide up the tests and run them in parallel on the cloud. The build time went down from 4 hours to just 16 minutes, but at a cost of about $7000 per month in cloud servers. That proved too expensive to be sustainable, so they reconsidered.

The automated tests are there to give confidence that the system works before you deploy. They realized that some of the tests gave more confidence than others, so they removed something like 60% of them. These less valuable tests were then run much less frequently. This took them off the critical path to deployment, and enabled more frequent releases. So far this strategy has led to only insignificant problems in production.

They also re-evaluated their policy for which scenarios to automate in Cucumber, and started putting much more emphasis on unit tests. When you have comprehensive Cucumber tests, it’s easy to have lots of confidence that the system works, and not bother with unit tests. Joseph said they were missing the design benefits of unit testing though – unit tests force your code to be decomposed into units! These days they write fewer Cucumber scenarios, and more unit tests.

Duplicating effort with QA

In addition to automated Cucumber scenarios, the QA people do manual testing before deployment. At one point the developers were surprised to learn than maybe 60-70% of these manual tests were covering functionality that was already well covered by Cucumber specs. They realized that the QA people had little confidence in the automated tests.

Once they had noticed the problem, they began to include the QA people more in the scenario writing process, so they could learn about what the tests did, and how they work. This helped give QA confidence to remove some of the manual tests, so now they only test the most crucial functionality manually before deployment.

Metrics as feedback

Joseph invented a tool called “Limited Red” that records metrics from every test run. It shows failure rates for each scenario, and correlates that with changes in the corresponding feature file, using the Git log. Using this data, he can plot graphs that show each feature, how many times it has failed, and how often it has been edited.

For example, he might find one of the features has failed 16 times, while the code in the feature was only changed 4 times. This could indicate the feature is testing an area of the code that is poorly implemented – the code is broken more often than the test is invalid. This gives developers feedback that they can act on to improve the quality of both the code and the Cucumber scenarios.

Joseph has a blog post with more detail about this tool and metrics approach.

My Concluding Thoughts

Most startups fail, and I have little insight into whether Songkick will be successful in the long run. I am fascinated though, by the way they are working. It seems to me they have a great process that promotes communication and feedback, coupled with enough introspection to adapt it as they learn more.

I’m interested to note that Songkick has extended the “Rule of Three” introduced by Lisa Crispin and Janet Gregory. In their book on “Agile Testing” they encourage you to include a tester in all discussions between a developer and customer. At Songkick they appear to have business analysts in the customer role, and additionaly include a UX (usability) expert in the discussion.

Joseph didn’t mention it, but I suspect that his experiences automating fewer scenarios with Cucumber may be one of the reasons Liz Keogh wrote her blog post “Step away from the tools“. Or maybe the blog came first, I don’t know. The advice is the same, in any case – BDD is about communication first, test automation second.

The “Limited Red” tool is quite new, and I think Joseph is still learning how to act on the data he’s gathering. Having said that, I have high hopes that eventually this kind of measurement and feedback will find its way into more automated testing tools and Jenkins builds. It seems to me that finding out which of your tests are giving the best value for money and which are costing the most to maintain will be generally useful.

My next challenge is to work out how to apply these kinds of ideas to the situation I find myself in right now. It’s a large company not a startup, the customer seems totally inaccessible, the release cycle is years not hours, the build is slow (when it works at all), and UX experts are thin on the ground. Hmm. As Joseph said, his story is about what they’ve found to work, not some universal “best practices”. It’s given me some new ideas and encouragement – now I just have to work out what principles and practices I can reasonably introduce where I am.

Notes 2013-5-13: one of Joseph’s colleagues wrote a post “From 15 hours to 15 seconds” explaining how they got the build time so low, a few months after I wrote this post. I also got a note from Liz Keogh explaining that her blog post “Step away from the tools” was written well before the events at Songkick, in March 2011, and that she consulted with them briefly after that, discussing the build time problem amongst other issues.