Spock: Data Driven Testing

In the last two articles on Spock I’ve covered mocking and stubbing. And I was pretty sold on Spock just based on that. But for a database driver, there’s a killer feature:  Data Driven Testing.

All developers have a tendency to think of and test the happy path. Not least of all because that’s usually the path in the User Story - “As a customer I want to withdraw money and have the correct amount in my hand”. We tend not to ask “what happens if they ask to withdraw money when the cash machine has no cash?” or “what happens when their account balance is zero?".

With any luck you’ll have a test suite covering your happy paths, and probably at least twice as many grumpy paths. If you’re like me, and you like one test to test one thing (and who doesn’t?), sometimes your test classes can get quite long as you test various edge cases. Or, much worse (and I’ve done this too) you use a calculation remarkably like the one you’re testing to generate test data. You run your test in a loop with the calculation and lo! The test passes. Woohoo?

Not that long ago I went through a process of re-writing a lot of unit tests that I had written a year or two before - we were about to do a big refactor of the code that generated some important numbers, and we wanted our tests to tell us we hadn’t broken anything with the refactor. The only problem was, the tests used a calculation rather similar to the production calculation, and borrowed some constants to create the expected number.  I ended up running the tests to find the numbers the test was generating as expected values, and hardcoding those values into the test. It felt dirty, but it was necessary - I wanted to make sure the refactoring didn’t change the expected numbers as well as the ones generated by the real code.  This is not a process I want to go through ever again.

When you’re testing these sorts of things, you try and think of a few representative cases, code them into your tests, and hope that you’ve covered the main areas. What would be far nicer is if you could shove a whole load of different data into your system-under-test and make sure the results look sane.

An example from the Java driver is that we had tests that were checking the parsing of the URI - you can initialise your MongoDB settings simply using a String containing the URI.

The old tests looked like:
(See MongoClientURITest)

Using Spock’s data driven testing, we changed this to:

(See MongoClientURISpecification)

Instead of having a separate test for every type of URL that needs parsing, you have a single test and each line in the where: section is a new combination of input URL and expected outputs. Each one of those lines used to be a test. In fact, some of them probably weren’t tests as the ugliness and overhead of adding another copy-paste test seemed like overkill. But here, in Spock, it’s just a case of adding one more line with a new input and set of outputs.

The major benefit here, to me, is that it’s dead easy to add another test for a “what if?” that occurs to the developer. You don’t have to have yet another test method that someone else is going to wonder “what the hell are we testing this for?". You just add another line which documents another set of expected outputs given the new input.

It’s easy, it’s neat, it’s succinct.

One of the major benefits of this to our team is that we don’t argue any more about whether a single test is testing too much. In the past, we had tests like:
And I can see why we have all those assertions in the same test, because technically these are all the same concept - make sure that each type of WriteConcern creates the correct command document. I believe these should be one test per line - because each line in the test is testing a different input and output, and I would want to document that in the test name (“fsync write concern should have fsync flag in getLastError command”, “journalled write concern should set j flag to true in getLastError command” etc). Also don’t forget that in JUnit, if the first assert fails, the rest of the test is not run. Therefore you have no idea if this is a failure that affects all write concerns, or just the first one. You lose the coverage provided by the later asserts.

But the argument against my viewpoint is then we’d have seven different one-line tests. What a waste of space.

You could argue for days about the best way to do it, or that this test is a sign of some other smell that needs addressing. But if you’re in a real world project and your aim is to both improve your test coverage and improve the tests themselves, these arguments are getting in the way of progress. The nice thing about Spock is that you can take these tests that test too much, and turn them into something a bit prettier:
You might be thinking, what’s the advantage over the JUnit way? Isn’t that the same thing but Groovier? But there’s one important difference - all the lines under where: get run, regardless of whether the test before it passes or fails. This basically is seven different tests, but takes up the same space as one.

That’s great, but if just one of these lines fails, how do you know which one it was if all seven tests are masquerading as one? That’s where the awesome @Unroll annotation comes in. This reports the passing or failing of each line as if it were a separate test. By default, when you run an unrolled test it will get reported as something like:

But in the test above we put some magic keywords into the test name: '#wc should return getlasterror document #commandDocument' - note that these values with # in front are the same headings from the where: section. They’ll get replaced by the value being run in the current test:

Yeah, it can be a bit of a mouthful if the toString is hefty, but it does give you an idea of what was being tested, and it’s prettier if the inputs have nice succinct string values:

This, combined with Spock’s awesome power assert makes it dead simple to see what went wrong when one of these tests fails.  Let’s take the example of (somehow) the incorrect host being returned for one of the input URIs:

Data driven testing might lead one to over-test the simple things, but the cost of adding another “what if?” is so low - just another line - and the additional safety you get from trying a different input is rather nice.  We’ve been using them for parsers and simple generators, where you want to throw in a bunch of inputs to a single method and see what you get out.

I’m totally sold on this feature, particularly for our type of application (the Java driver does a lot of taking stuff in one shape and turning it into something else).  Just in case you want a final example, here’s a final one.

The old way:
…and in Spock:

Spock passes the next test – Painless Stubbing

In the last post I talked about our need for some improved testing tools, our choice of Spock as something to spike, and how mocking looks in Spock.

As that blog got rather long, I saved the next installment for a separate post.

Today I want to look at stubbing.

Stubbing
Mocking is great for checking outputs - in the example in the last post, we’re checking that the process of encoding an array calls the right things on the way out, if you like - that the right stuff gets poked onto the bsonWriter.

Stubbing is great for faking your inputs (I don’t know why this difference never occurred to me before, but Colin’s talk at Devoxx UK made this really clear to me).

One of the things we need to do in the compatibility layer of the new driver is to wrap all the new style Exceptions that can be thrown by the new architecture layer and turn them into old-style Exceptions, for backwards compatibility purposes.  Sometimes testing the exceptional cases is… challenging.  So I opted to do this with Spock.

So here we can use a real DB class, but with a mock Mongo that will return us a “mock” Session.  It’s not actually a mock though, it’s more of a stub because we want to tell it how to behave when it’s called - in this test, we want to force it to throw an org.mongodb.MongoException whenever execute is called.  It doesn’t matter to us what get passed in to the execute method (that’s what the underscore means on line 16), what matters is that when it gets called it throws the correct type of Exception.

Like before, the when: section shows the bit we’re actually trying to test. In this case, we want to call rename.

Then finally the then: section asserts that we received the correct sort of Exception.  It’s not enormously clear, although I’ve kept the full namespace in to try and clarify, but the aim is that any org.mongodb.MongoException that gets thrown by the new architecture gets turned into the appropriate com.mongodb.MongoException.  We’re sort of “lucky” because the old code is in the wrong package structure, and in the new architecture we’ve got a chance to fix this and put stuff into the right place.

Once I’d tracked down all the places Exceptions can escape and started writing these sorts of tests to exercise those code paths, not only did I feel more secure that we wouldn’t break backwards compatibility by leaking the wrong Exceptions, but we also found our test coverage went up - and more importantly, in the unhappy paths, which are often harder to test.

I mentioned in the last post that we already did some simple stubbing to help us test the data driver. Why not just keep using that approach?

Well, these stubs end up looking like this:

Ick.

And you end up extending them so you can just override the method you’re interested in (particularly in the case of forcing a method to throw an exception).  Most irritatingly to me, these stubs live away from the actual tests, so you can’t easily see what the expected behaviour is.  In the Spock test, the expected stubbed behaviour is defined on line 16, the call that will provoke it is on line 19 and the code that checks the expectation is on line 22.  It’s all within even the smallest monitor’s window.

So stubbing in Spock is painless.  Next:

Spock is awesome! Seriously Simplified Mocking

We’re constantly fighting a battle when developing the new MongoDB Java driver between using tools that will do heavy lifting for us and minimising the dependencies a user has to download in order to use our driver.  Ideally, we want the number of dependencies to be zero.

This is not going to be the case when it comes to testing, however.  At the very least, we’re going to use JUnit or TestNG (we used testng in the previous version, we’ve switched to JUnit for 3.0).  Up until recently, we worked hard to eliminate the need for a mocking framework - the driver is not a large application with interacting services, most stuff can be tested either as an integration test or with very simple stubs.

Recently I was working on the serialisation layer - we’re making quite big changes to the model for encoding and decoding between BSON and Java, we’re hoping this will simplify our lives but also make things a lot easier for the ODMs (Object-Document Mappers) and third party libraries.  At this level, it makes a lot of sense to introduce mocks - I want to ensure particular methods are called on the writer, for example, I don’t want to check actual byte values, that’s not going to be very helpful for documentation (although there is a level where that is a sensible thing to do).

We started using JMock to begin with, it’s what I’ve been using for a while, and it gave us what we wanted - a simple mocking framework (I tried Mockito too, but I’m not so used to the failure messages, so I found it really hard to figure out what was wrong when a test failed).

I knew from my spies at LMAX that there’s some Groovy test framework called Spock that is awesome, apparently, but I immediately discarded it - I feel very strongly that tests are documentation, and since the users of the Java driver are largely Java developers, I felt like introducing tests in a different language was an added complexity we didn’t need.

Then I went to GeeCON, and my ex-colleague Israel forced me to go to the talk on Spock.  And I realised just how wrong I had been.  Far from adding complexity, here was a lovely, descriptive way of writing tests.  It’s flexible, and yet structured enough get you thinking in a way that should create good tests.

Since we’re already using gradle, which is Groovy as well, we decided it was worth a spike to see if Spock would give us any benefits.

During the spike I converted a selection of our tests to Spock tests to see what it looks like on a real codebase.  I had very specific things I wanted to try out:

  • Mocking
  • Stubbing
  • Data driven testing

In the talk I also saw useful annotation like @Requires, which I’m pretty sure we’re going to use, but I don’t think it’s made it into a build yet.

So, get this, I’m going to write a blog post with Actual Code in.  Yeah, I know, you all thought I was just a poncy evangelist these days and didn’t do any real coding any more.

First up, Mocking
So, as I said, I have a number of tests checking that encoding of Java objects works the way we expect.   The easiest way to test this is to mock our BSONWriter class to ensure that the right interactions are happening against it.  This is a nice way to check that when you give an encoder a particular set of data, it gets serialised in the way BSON expects. These tests ended up looking something like this:

(Yeah, I’m still learning Spanish).

So that’s quite nice, my test checks that given a List of Strings, they get serialised correctly.  What’s not great is some of the setup overhead:

Obviously some of the things there are going to be ringing some people’s alarm bells, but let’s assume for a minute that all decisions were taken carefully and that pros and cons were weighed accordingly.

So:

  • Mocking concrete classes is not pretty in JMock, just look at that setUp method.
  • We’re using the JUnitRuleMockery, which appears to be Best Practice (and means you’re less likely to forget the @RunWith(JMock.class) annotation), but checkstyle hates it - Public Fields Are Bad as we all know.

But it’s fine, a small amount of boilerplate for all our tests that involve mocking is an OK price to pay to have some nice tests.

I converted this test to a Spock test.  Groovy purists will notice that it’s still very Java-y, and that’s intentional - I want these tests, at least at this stage while we’re getting used to it, to be familiar to Java programmers, our main audience.

Some initial observations:

  • It’s a really simple thing, but I like having the @Subject annotation on the thing you’re testing.  In theory it should be obvious which of your fields or variables is the subject under test, but in practice that’s not always true.
  • Although it freaks me out as someone who’s been doing Java for the last 15 years, I really like the String for method name - although in this case it’s the same as the JMock/JUnit equivalent, it gives a lot more flexibility for describing the purpose of this test.
  • Mocking is painless, with a simple call to Mock(), even though we’re still mocking concrete classes (this is done simply by adding cglib and obgenesis to the dependencies).
  • I love that the phases of Spock (setup: when: then:) document the different parts of the test while also being the useful magic keywords which tell Spock how to run the test.  I know other frameworks provide this, but we’ve been working with JUnit and I’ve been in the habit of commenting my steps with //given //when //then.
  • Thanks to Groovy, creation of lists is less boiler plate (line 9).  Not a big deal, but just makes it easier to read.
  • I’ve got very used to the way expectations are set up in JMock, but I have to say that 1 * bsonWriter.blahblahblah() is much more readable.  
  • I love that everything after then: is an assertion, I think it makes it really clear what you expect to happen after you invoke the thing you’re testing.

So mocking is awesome.  What’s next?

Christmas decorations teach me a lesson about troubleshooting

And now, after an absence of several weeks, you get to see how long it takes me to write some of these posts.

I was putting up the Christmas decorations one Saturday when my worst fear was realised1 - one of my three strings of lights was not working.

The first two went up fine.  The third lit up when I plugged it in, and in less than a second went out.  Curses.  This is not what I wanted, this was supposed to be a short exercise in making my tiny little flat look festive.

So I set about the tedious task of starting from the end closest to the plug and replacing every bulb, one by one, with a spare one to see if it magically lit up again.  When it doesn’t, you take the spare back out and replace it with the original bulb.  I remember my parents going through this ritual every Christmas, the tediousness of this activity is more memorable than the fleeting joy of shinies.

While I was doing this, my mind was back on the job I’d been doing at work the previous week - battling an Internet Explorer 7 performance problem.  We have automated performance tests which give us an indication of the load time for our application in Chrome and IE, and some time in the previous couple of weeks our IE performance had significantly degraded in the development code.  Due to a number of too-boring-to-explain-here circumstances, the last known good revision was four days and nearly 250 revisions earlier than the first revision that showed the performance problem.

Since we couldn’t see anything to indicate it was an environmental problem, the logical next step was to pinpoint the revision which caused the problem, so we could either fix it or get performance gains from somewhere else in the system.

The most obvious way to do this, given there were no obvious suspects, is with a binary search of the revisions.  Our last known good revision was 081, our first poor performing one was 240.  So the thing to do is to check revision 160, see if it falls on the poor or good performance side.

If 160 proves to be a poor performer, check revision 120….

…if 160 is fine, test revision 200…

…and keep splitting the revisions by half until you find the suspect.

So of course that’s what I want to do with my stupid Christmas lights.  I do not want to sequentially check each light bulb, that has a worst-case number-of-bulbs-tried = n, where n is the number of bulbs (probably a couple of hundred, although it felt like several thousand).  So, in computer speak, O(n).  The binary search algorithm is O(log n).  At university, this Big Oh had no context for me.  But when you’ve taken 10 minutes to get a quarter of the way through your Christmas lights, and you diagnosed your IE performance problems… well, actually it took days.  But the point is, a binary search for the missing bulb would definitely have been a Good Thing.

I know you’re dying to know if I tracked down the problem in Internet Explorer - I did.  What’s the worst case when you’re doing a binary search?  It’s when the thing you’re looking for is veeeery close to either your start point or your end point.

The revision number I was after was number 237.  Sigh.

And my Christmas tree lights?  Well, through the boredom I remembered that modern lights are in sections, so they have a sort of built-in binary search - well, limited segments will go dark if a single bulb is out - which allows you to narrow down your problem area.  Since the whole string was out, I figured something else was probably wrong.

Turned out the plug had come out of the socket.

So:
Lesson 1: Theoretical computer science does have a place when you care about how long something takes.  When it’s you feeling the pain, you’ll do anything to make it stop.

Lesson 2: When diagnosing a problem you will always biased towards what you think it is, in the face of actual evidence.  I was afraid I would have to search the whole set of lights for a blown bulb, so that’s the problem I was looking for when the lights failed.  In actual fact it was a problem with a much simpler solution.

1 - OK, “worst fear” in this very limited context only - it’s not like I lie awake at night in July afraid that one of the bulbs on my Christmas lights has blown.