Using Groovy to import XML into MongoDB

This year I’ve been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts.

Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail.

Firstly I want to clarify that this is not my original work - I stole borrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There’s a section in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it’s this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM.

I’m assuming if you’ve worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I did not want to do this. When I wrote the demo I wasn’t really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB.

Goal of the Script

  • Parse an XML file containing open street map data of all coffee shops.
  • Extract latitude and longitude XML attributes and transform into MongoDB GeoJSON.
  • Perform some basic validation on the coffee shop data from the XML.
  • Insert into MongoDB.
  • Make sure MongoDB knows this contains query-able geolocation data.

The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne:

"PopulateDatabase.groovy"

Firstly, we need data

I used the same service Ross used in his blog post to obtain the XML file containing “all” coffee shops around the world. Now, the open street map data is somewhat… raw and unstructured (which is why MongoDB is such a great tool for storing it), so I’m not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using

http://www.overpass-api.de/api/xapi?*[amenity=cafe][cuisine=coffee_shop] 

The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results.

Each XML record looks something like:

<node id="178821166" lat="40.4167226" lon="-3.7069112">     <tag k="amenity" v="cafe"/>     <tag k="cuisine" v="coffee_shop"/>     <tag k="name" v="Chocolatería San Ginés"/>     <tag k="wheelchair" v="limited"/>     <tag k="wikipedia" v="es:Chocolatería San Ginés"/> </node> 

Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data).

Initialisation

<img src="https://trishagee.github.io/static/images/GroovyScript1.png&#34; alt="Script Initialisation" title="Script Initialisation">

Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we’re going to use the [MongoDB Java Driver] (http://docs.mongodb.org/ecosystem/drivers/java/) to get the collection that we’re interested in, and then drop it.

There’s two interesting things to note here:

  • This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn’t need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well.
  • You don’t need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you’re interested in, and if it doesn’t already exist, the server will create them for you.

In this example, we’re just using the default constructor for the MongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java.

Turn the XML into something MongoDB-shaped

<img src="https://trishagee.github.io/static/images/GroovyScript2.png&#34; alt="Parse & Transform XML" title="Parse & Transform XML">

So next we’re going to use Groovy’s XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every “node” element in the XML.

Create GeoJSON

<img src="https://trishagee.github.io/static/images/GroovyScript3.png&#34; alt="Create GeoJSON" title="Create GeoJSON"> Since MongoDB documents are effectively just maps of key-value pairs, we’re going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like:

<node id="18464077" lat="-33.8911183" lon="151.1958773"> 

We’re going to save the ID as a value for a new field called openStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like:

{ 'location' : { 'coordinates': [<longitude>, <latitude>],                  'type'       : 'Point' } } 

In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places.

Insert Remaining Fields

<img src="https://trishagee.github.io/static/images/GroovyScript4.png&#34; alt="Insert Remaining Fields" title="Insert Remaining Fields"> <img src="https://trishagee.github.io/static/images/GroovyScript5.png&#34; alt="Validate Field Name" title="Validate Field Name"> Now for every tag element in the XML, we get the k attribute and check if it’s a valid field name for MongoDB (it won’t let us insert fields with a dot in, and we don’t want to override our carefully constructed location field). If so we simply add this key as the field and its the matching v attribute as the value into the map. This effectively copies the OpenStreetMap key/value data into key/value pairs in the MongoDB document so we don’t lose any data, but we also don’t do anything particularly interesting to transform it.

Save Into MongoDB

<img src="https://trishagee.github.io/static/images/GroovyScript6.png&#34; alt="Save Into MongoDB" title="Save Into MongoDB"> Finally, once we’ve created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it’s actually much easier just to use the pretty Groovy syntax to check for a key called name in coffeeShop.

When we want to insert the Map we need to turn this into a BasicDBObject, the Java Driver’s document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there’s a Groovy syntax which would effectively do the same thing, which you might prefer:

collection.insert(coffeeShop as BasicDBObject) 

Tell MongoDB that we want to perform Geo queries on this data

<img src="https://trishagee.github.io/static/images/GroovyScript7.png&#34; alt="Add Geo Index" title="Add Geo Index"> Because we’re going to do a nearSphere query on this data, we need to add a “2dsphere” index on our location field. We created the location field as GeoJSON, so all we need to do is call createIndex for this field.

Conclusion

So that’s it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.

Converting Blogger to Markdown

I’ve been using Blogger happily for three years or so, since I migrated the blog from LiveJournal and decided to actually invest some time writing. I’m happy with it because I just type stuff into Blogger and It Just Works. I’m happy because I can use my Google credentials to sign in. I’m happy because now I can pretend my two Google+ accounts exist for a purpose, by getting Blogger to automatically share my content there.

A couple of things have been problematic for the whole time I’ve been using it though:

  1. Code looks like crap, no matter what you do.
  2. Pictures are awkwardly jammed in to the prose like a geek mingling at a Marketing event.

The first problem I’ve tried to solve a number of ways, with custom CSS at a blog- and a post- level. I was super happy when I discovered gist, it gave me lovely content highlighting without all the nasty CSS. It’s still not ideal in a blogger world though, as the gist doesn’t appear in your WYSIWYG editor, leading you to all sorts of tricks to try not to accidentally delete it. Also I was too lazy to migrate old code over, so now my blog is a mish-mash of code styles, particular where I changed global CSS mulitple times, leaving old code in a big fat mess. There’s a lesson to be learned there somewhere.

The second problem, photos, I just gave up on. I decided I would end up wasting too much time trying to make the thing look pretty, and I’d never get around to posting anything. So my photos are always dropped randomly into the blogs - it’s better than a whole wall of prose (probably).

But I’ve been happy overall, the main reason being I don’t have to maintain anything, I don’t have to worry about my web server going down, I don’t have versions of a blog platform to maintain, patch, upgrade; I can Just Write.

But last week my boss and my colleague were both on at me to try Hugo, a site generator created by my boss. I was resistent because I do not want to maintain my own blog platform, but then Christian explained how I can write my posts in markdown, use Hugo to generate the content, and then host it github pages. It sounded relatively painless.

I’ve been considering a move to something that supports markdown for a while, for the following reasons:

  1. These days I write at least half of my posts on the plane, so I use TextEdit to write the content, and later paste this into blogger and add formatting. It would be better if I could write markdown to begin with.
  2. Although I’ve always disliked wiki-type syntax for documentation, markdown is actually not despicable, and lets me add simple formatting easily without getting in my way or breaking my flow.

So I spent a few days playing with Hugo to see what it was, how it worked, and whether it was going to help me. I’ve come up with a few observations:

Hugo really is lightning fast. If I add a .md file in the appropriate place, and with the Hugo server running on my local machine it will turn this into real HTML in (almost) less time than it takes for me to refresh the browser on the second monitor. Edits to existing files appear almost instantly, so I can write a post and preview it really easily. It beats the hell out of blogger’s Preview feature, which I always need to use if I’m doing anything other than posting simple prose.

It’s awesome to type my blog in IntelliJ. Do you find yourself trying to use IntelliJ shortcuts in other editors? The two I miss the most when I’m not in IntelliJ are Cmd+Y to delete a line, and Ctrl+Shift+J to bring the next line up. Writing markdown in IntelliJ with my usual shortcuts (and the markdown plugin) is really easy and productive. Plus, of course, you get IntelliJ’s ability to paste from any item in the clipboard history. And I don’t have to worry about those random intervals when blogger tells me it hasn’t saved my content, and I have no idea if I will just lose hours of work.

I now own my own content. It never really occurred to me before that all the effort I’ve put into three years of regular blogging is out there, on some Google servers somewhere, and I don’t have a copy of that material. That’s dumb, that doesn’t reflect how seriously I take my writing. Now I have that content here, on my laptop, and it’s also backed up in Github, both as raw markdown and as generated HTML, and versioned. Massive massive win.

I have more control over how things are rendered, and I can customise the display much more. This has drawbacks though too, as it’s exactly this freedom-to-play that I worry will distract me from actual writing.

As with every project that’s worth trying, it wasn’t completely without pain. I followed the (surprisingly excellent) documentation, as well as these guidelines, but I did run into some fiddly bits:

  1. I couldn’t quite get my head around the difference between my Hugo project code and my actual site content to begin with: how to put them into source control and how to get my site on github pages. I’ve ended up with two projects on github, even though the generated code is technically a subtree of the Hugo project. I think I’m happy with that.
  2. I’m not really sure about the difference between tags, keywords, and topics, if I’m honest. Maybe this is something I’ll grow into.
  3. I really need to spend some time on the layout and design, I don’t want to simply rip off Steve’s original layout. Plus there are things I would like to have on the main page which are missing.
  4. I needed to convert my old content to the new format
  5. Final migration from old to new (incomplete)

To address the last point first, I’m not sure yet if I will take the plunge and do full redirection from Blogger to the new github pages site (and redirect my domains too), for a while I’m going to run both in parallel and see how I feel.

As for the fourth point, I didn’t find a tool for migrating Blogger blogs into markdown that didn’t require me to install some other tool or language, and there was nothing that was specifically Hugo-shaped, so I surprised myself and did what every programmer would - I wrote my own. Surprising because I’m not normally that sort of person - I like to use tools that other people have written, I like things that Just Work, I spend all my time coding for my job so I can’t be bothered to devote extra time to it. But my recent experiences with Groovy had convinced me that I could write a simple Groovy parser that would take my exported blog (in Atom XML format) and turn it into a series of markdown files. And I was right, I could. So I’ve created a new github project, atom-to-hugo. It’s very rough, but a) it works and b) it even has tests. And documentation.

I don’t know what’s come over me lately, I’ve been a creative, coding machine.

In summary, I’m pretty happy with the new way of working, but it’s going to take me a while to get used to it and decide if it’s the way I want to go. At the very least, I now have my Blogger content as something markdown-ish.

But there are a couple of things I miss about Blogger:

  1. I actually like the way it shows the blog archive on the right hand side, split into months and years. I use that to motivate me to blog more if a month looks kinda empty
  2. While Google Analytics is definitely more powerful than the simple blogger analytics, I find them an easier way to get a quick insight into whether people are reading the blog, and which paths they take to find it.

I don’t think either of these are showstoppers, I should be able to work around both of them.

Spock: Data Driven Testing

In the last two articles on Spock I've covered mocking and stubbing. And I was pretty sold on Spock just based on that. But for a database driver, there's a killer feature: Data Driven Testing.

All developers have a tendency to think of and test the happy path. Not least of all because that's usually the path in the User Story - "As a customer I want to withdraw money and have the correct amount in my hand". We tend not to ask "what happens if they ask to withdraw money when the cash machine has no cash?" or "what happens when their account balance is zero?".

With any luck you'll have a test suite covering your happy paths, and probably at least twice as many grumpy paths. If you're like me, and you like one test to test one thing (and who doesn't?), sometimes your test classes can get quite long as you test various edge cases. Or, much worse (and I've done this too) you use a calculation remarkably like the one you're testing to generate test data. You run your test in a loop with the calculation and lo! The test passes. Woohoo?

Not that long ago I went through a process of re-writing a lot of unit tests that I had written a year or two before - we were about to do a big refactor of the code that generated some important numbers, and we wanted our tests to tell us we hadn't broken anything with the refactor. The only problem was, the tests used a calculation rather similar to the production calculation, and borrowed some constants to create the expected number. I ended up running the tests to find the numbers the test was generating as expected values, and hardcoding those values into the test. It felt dirty, but it was necessary - I wanted to make sure the refactoring didn't change the expected numbers as well as the ones generated by the real code. This is not a process I want to go through ever again.

When you're testing these sorts of things, you try and think of a few representative cases, code them into your tests, and hope that you've covered the main areas. What would be far nicer is if you could shove a whole load of different data into your system-under-test and make sure the results look sane.

An example from the Java driver is that we had tests that were checking the parsing of the URI - you can initialise your MongoDB settings simply using a String containing the URI.

The old tests looked like:

@Test()
public void testSingleServer() {
    MongoClientURI u = new MongoClientURI("mongodb://db.example.com");
    assertEquals(1, u.getHosts().size());
    assertEquals("db.example.com", u.getHosts().get(0));
    assertNull(u.getDatabase());
    assertNull(u.getCollection());
    assertNull( u.getUsername());
    assertEquals(null, u.getPassword());
}

@Test()
public void testWithDatabase() {
    MongoClientURI u = new MongoClientURI("mongodb://foo/bar");
    assertEquals(1, u.getHosts().size());
    assertEquals("foo", u.getHosts().get(0));
    assertEquals("bar", u.getDatabase());
    assertEquals(null, u.getCollection());
    assertEquals(null, u.getUsername());
    assertEquals(null, u.getPassword());
}

@Test()
public void testWithCollection() {
    MongoClientURI u = new MongoClientURI("mongodb://localhost/test.my.coll");
    assertEquals("test", u.getDatabase());
    assertEquals("my.coll", u.getCollection());
}

@Test()
public void testBasic2() {
    MongoClientURI u = new MongoClientURI("mongodb://foo/bar.goo");
    assertEquals(1, u.getHosts().size());
    assertEquals("foo", u.getHosts().get(0));
    assertEquals("bar", u.getDatabase());
    assertEquals("goo", u.getCollection());
}

(view gist, and see the code in its original home: MongoClientURITest)

Using Spock's data driven testing, we changed this to:

@Unroll
def 'should parse #uri into correct components'() {
    expect:
    uri.getHosts().size() == num;
    uri.getHosts() == hosts;
    uri.getDatabase() == database;
    uri.getCollection() == collection;
    uri.getUsername() == username;
    uri.getPassword() == password;

    where:
    uri                                            | num | hosts              | database | collection | username | password
    new MongoClientURI('mongodb://db.example.com') | 1   | ['db.example.com'] | null     | null       | null     | null
    new MongoClientURI('mongodb://foo/bar')        | 1   | ['foo']            | 'bar'    | null       | null     | null
    new MongoClientURI('mongodb://localhost/' +
                               'test.my.coll')     | 1   | ['localhost']      | 'test'   | 'my.coll'  | null     | null
    new MongoClientURI('mongodb://foo/bar.goo')    | 1   | ['foo']            | 'bar'    | 'goo'      | null     | null
    new MongoClientURI('mongodb://user:pass@' +
                               'host/bar')         | 1   | ['host']           | 'bar'    | null       | 'user'   | 'pass' as char[]
    new MongoClientURI('mongodb://user:pass@' +
                               'host:27011/bar')   | 1   | ['host:27012']     | 'bar'    | null       | 'user'   | 'pass' as char[]
    new MongoClientURI('mongodb://user:pass@' +
                               'host:7,' +
                               'host2:8,' +
                               'host3:9/bar')      | 3   | ['host:7',
                                                            'host2:8',
                                                            'host3:9']        | 'bar'    | null       | 'user'   | 'pass' as char[]
}

(view gist, and see the code in its original home: MongoClientURISpecification)

Instead of having a separate test for every type of URL that needs parsing, you have a single test and each line in the where: section is a new combination of input URL and expected outputs. Each one of those lines used to be a test. In fact, some of them probably weren't tests as the ugliness and overhead of adding another copy-paste test seemed like overkill. But here, in Spock, it's just a case of adding one more line with a new input and set of outputs.

The major benefit here, to me, is that it's dead easy to add another test for a "what if?" that occurs to the developer. You don't have to have yet another test method that someone else is going to wonder "what the hell are we testing this for?". You just add another line which documents another set of expected outputs given the new input.

It's easy, it's neat, it's succinct.

One of the major benefits of this to our team is that we don't argue any more about whether a single test is testing too much. In the past, we had tests like:

@Test
public void testGetLastErrorCommand() {
    assertEquals(new BasicDBObject("getlasterror", 1), WriteConcern.UNACKNOWLEDGED.getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1), WriteConcern.ACKNOWLEDGED.getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1).append("w", 2), WriteConcern.REPLICA_ACKNOWLEDGED.getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1).append("j", true), WriteConcern.JOURNALED.getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1).append("fsync", true), WriteConcern.FSYNCED.getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1).append("w", "majority"), new WriteConcern("majority").getCommand());
    assertEquals(new BasicDBObject("getlasterror", 1).append("wtimeout", 100), new WriteConcern(1, 100).getCommand());
}

(view gist)

And I can see why we have all those assertions in the same test, because technically these are all the same concept - make sure that each type of WriteConcern creates the correct command document. I believe these should be one test per line - because each line in the test is testing a different input and output, and I would want to document that in the test name ("fsync write concern should have fsync flag in getLastError command", "journalled write concern should set j flag to true in getLastError command" etc). Also don't forget that in JUnit, if the first assert fails, the rest of the test is not run. Therefore you have no idea if this is a failure that affects all write concerns, or just the first one. You lose the coverage provided by the later asserts.

But the argument against my viewpoint is then we'd have seven different one-line tests. What a waste of space.

You could argue for days about the best way to do it, or that this test is a sign of some other smell that needs addressing. But if you're in a real world project and your aim is to both improve your test coverage and improve the tests themselves, these arguments are getting in the way of progress. The nice thing about Spock is that you can take these tests that test too much, and turn them into something a bit prettier:

@Unroll
def '#wc should return getlasterror document #commandDocument'() {
    expect:
    wc.asDocument() == commandDocument;

    where:
    wc                                | commandDocument
    WriteConcern.UNACKNOWLEDGED       | ['getlasterror': 0]
    WriteConcern.ACKNOWLEDGED         | ['getlasterror': 1]
    WriteConcern.REPLICA_ACKNOWLEDGED | ['getlasterror': 1, 'w': 2]
    WriteConcern.JOURNALED            | ['getlasterror': 1, 'j': true]
    WriteConcern.FSYNCED              | ['getlasterror': 1, 'fsync': true]
    new WriteConcern('majority')      | ['getlasterror': 1, 'w': 'majority']
    new WriteConcern(1, 100)          | ['getlasterror': 1, 'wtimeout': 100]
}

(view gist)

You might be thinking, what's the advantage over the JUnit way? Isn't that the same thing but Groovier? But there's one important difference - all the lines under where: get run, regardless of whether the test before it passes or fails. This basically is seven different tests, but takes up the same space as one.

That's great, but if just one of these lines fails, how do you know which one it was if all seven tests are masquerading as one? That's where the awesome @Unroll annotation comes in. This reports the passing or failing of each line as if it were a separate test. By default, when you run an unrolled test it will get reported as something like:

But in the test above we put some magic keywords into the test name:

#wc should return getlasterror document #commandDocument - note that these values with # in front are the same headings from the where: section. They'll get replaced by the value being run in the current test:

Yeah, it can be a bit of a mouthful if the toString is hefty, but it does give you an idea of what was being tested, and it's prettier if the inputs have nice succinct string values:

This, combined with Spock's awesome power assert makes it dead simple to see what went wrong when one of these tests fails. Let's take the example of (somehow) the incorrect host being returned for one of the input URIs:

Data driven testing might lead one to over-test the simple things, but the cost of adding another "what if?" is so low - just another line - and the additional safety you get from trying a different input is rather nice. We've been using them for parsers and simple generators, where you want to throw in a bunch of inputs to a single method and see what you get out.

I'm totally sold on this feature, particularly for our type of application (the Java driver does a lot of taking stuff in one shape and turning it into something else). Just in case you want a final example, here's a final one.

The old way:

@Test
public void shouldGenerateIndexNameForSimpleKey() {
    final Index index = new Index("x");
    assertEquals("x_1", index.getName());
}

@Test
public void shouldGenerateIndexNameForKeyOrderedAscending() {
    final Index index = new Index("x", OrderBy.ASC);
    assertEquals("x_1", index.getName());
}

@Test
public void shouldGenerateIndexNameForKeyOrderedDescending() {
    final Index index = new Index("x", OrderBy.DESC);
    assertEquals("x_-1", index.getName());
}

@Test
public void shouldGenerateGeoIndexName() {
    final Index index = new Index(new Index.GeoKey("x"));
    assertEquals("x_2d", index.getName());
}

@Test
public void shouldCompoundIndexName() {
    final Index index = new Index(new Index.OrderedKey("x", OrderBy.ASC),
                                  new Index.OrderedKey("y", OrderBy.ASC),
                                  new Index.OrderedKey("a", OrderBy.ASC));
    assertEquals("x_1_y_1_a_1", index.getName());
}

@Test
public void shouldGenerateGeoAndSortedCompoundIndexName() {
    final Index index = new Index(new Index.GeoKey("x"),
                                  new Index.OrderedKey("y", OrderBy.DESC));
    assertEquals("x_2d_y_-1", index.getName());
}

(view gist)

...and in Spock:

@Unroll
def 'should generate index name #indexName for #index'() {
    expect:
    index.getName() == indexName;

    where:
    index                                              | indexName
    new Index('x')                                     | 'x_1'
    new Index('x', OrderBy.ASC)                        | 'x_1'
    new Index('x', OrderBy.DESC)                       | 'x_-1'
    new Index(new Index.GeoKey('x'))                   | 'x_2d'
    new Index(new Index.OrderedKey('x', OrderBy.ASC),
              new Index.OrderedKey('y', OrderBy.ASC),
              new Index.OrderedKey('a', OrderBy.ASC))  | 'x_1_y_1_a_1'
    new Index(new Index.GeoKey('x'),
              new Index.OrderedKey('y', OrderBy.DESC)) | 'x_2d_y_-1'
}

(view gist)

See also:

Spock passes the next test – Painless Stubbing

In the last post I talked about our need for some improved testing tools, our choice of Spock as something to spike, and how mocking looks in Spock.

As that blog got rather long, I saved the next installment for a separate post.

Today I want to look at stubbing.

Stubbing

Mocking is great for checking outputs - in the example in the last post, we're checking that the process of encoding an array calls the right things on the way out, if you like - that the right stuff gets poked onto the bsonWriter.

Stubbing is great for faking your inputs (I don't know why this difference never occurred to me before, but Colin's talk at Devoxx UK made this really clear to me).

One of the things we need to do in the compatibility layer of the new driver is to wrap all the new style Exceptions that can be thrown by the new architecture layer and turn them into old-style Exceptions, for backwards compatibility purposes. Sometimes testing the exceptional cases is... challenging. So I opted to do this with Spock.

class DBCollectionSpecification extends Specification {
    private final Mongo mongo = Mock()
    private final ServerSelectingSession session = Mock()

    private final DB database = new DB(mongo, 'myDatabase', new DocumentCodec())
    
    @Subject
    private final DBCollection collection = new DBCollection('collectionName', database, new DocumentCodec())

    def setup() {
        mongo.getSession() >> { session }
    }

    def 'should throw com.mongodb.MongoException if rename fails'() {
        setup:
        session.execute(_) >> { throw new org.mongodb.MongoException('The error from the new Java layer') }

        when:
        collection.rename('newCollectionName');

        then:
        thrown(com.mongodb.MongoException)
    }
}

(view gist)

So here we can use a real DB class, but with a mock Mongo that will return us a "mock" Session. It's not actually a mock though, it's more of a stub because we want to tell it how to behave when it's called - in this test, we want to force it to throw an org.mongodb.MongoException whenever execute is called. It doesn't matter to us what get passed in to the execute method (that's what the underscore means on line 16), what matters is that when it gets called it throws the correct type of Exception.

Like before, the when: section shows the bit we're actually trying to test. In this case, we want to call rename.

Then finally the then: section asserts that we received the correct sort of Exception. It's not enormously clear, although I've kept the full namespace in to try and clarify, but the aim is that any org.mongodb.MongoException that gets thrown by the new architecture gets turned into the appropriate com.mongodb.MongoException. We're sort of "lucky" because the old code is in the wrong package structure, and in the new architecture we've got a chance to fix this and put stuff into the right place.

Once I'd tracked down all the places Exceptions can escape and started writing these sorts of tests to exercise those code paths, not only did I feel more secure that we wouldn't break backwards compatibility by leaking the wrong Exceptions, but we also found our test coverage went up - and more importantly, in the unhappy paths, which are often harder to test.

I mentioned in the last post that we already did some simple stubbing to help us test the data driver. Why not just keep using that approach?

Well, these stubs end up looking like this:

private static class TestAsyncConnectionFactory implements AsyncConnectionFactory {
    @Override
    public AsyncConnection create(final ServerAddress serverAddress) {
        return new AsyncConnection() {
            @Override
            public void sendMessage(final List<ByteBuf> byteBuffers, final SingleResultCallback<Void> callback) {
                throw new UnsupportedOperationException();
            }

            @Override
            public void receiveMessage(final ResponseSettings responseSettings, final SingleResultCallback<ResponseBuffers> callback) {
                throw new UnsupportedOperationException();
            }

            @Override
            public void close() {
            }

            @Override
            public boolean isClosed() {
                throw new UnsupportedOperationException();
            }

            @Override
            public ServerAddress getServerAddress() {
                throw new UnsupportedOperationException();
            }
        };
    }
}

(view gist)
Ick.

And you end up extending them so you can just override the method you're interested in (particularly in the case of forcing a method to throw an exception). Most irritatingly to me, these stubs live away from the actual tests, so you can't easily see what the expected behaviour is. In the Spock test, the expected stubbed behaviour is defined on line 16, the call that will provoke it is on line 19 and the code that checks the expectation is on line 22. It's all within even the smallest monitor's window.

So stubbing in Spock is painless. Next:

Spock is awesome! Seriously Simplified Mocking

We're constantly fighting a battle when developing the new MongoDB Java driver between using tools that will do heavy lifting for us and minimising the dependencies a user has to download in order to use our driver. Ideally, we want the number of dependencies to be zero.

This is not going to be the case when it comes to testing, however. At the very least, we're going to use JUnit or TestNG (we used testng in the previous version, we've switched to JUnit for 3.0). Up until recently, we worked hard to eliminate the need for a mocking framework - the driver is not a large application with interacting services, most stuff can be tested either as an integration test or with very simple stubs.

Recently I was working on the serialisation layer - we're making quite big changes to the model for encoding and decoding between BSON and Java, we're hoping this will simplify our lives but also make things a lot easier for the ODMs (Object-Document Mappers) and third party libraries. At this level, it makes a lot of sense to introduce mocks - I want to ensure particular methods are called on the writer, for example, I don't want to check actual byte values, that's not going to be very helpful for documentation (although there is a level where that is a sensible thing to do).

We started using JMock to begin with, it's what I've been using for a while, and it gave us what we wanted - a simple mocking framework (I tried Mockito too, but I'm not so used to the failure messages, so I found it really hard to figure out what was wrong when a test failed).

I knew from my spies at LMAX that there's some Groovy test framework called Spock that is awesome, apparently, but I immediately discarded it - I feel very strongly that tests are documentation, and since the users of the Java driver are largely Java developers, I felt like introducing tests in a different language was an added complexity we didn't need.

Then I went to GeeCON, and my ex-colleague Israel forced me to go to the talk on Spock. And I realised just how wrong I had been. Far from adding complexity, here was a lovely, descriptive way of writing tests. It's flexible, and yet structured enough get you thinking in a way that should create good tests.

Since we're already using gradle, which is Groovy as well, we decided it was worth a spike to see if Spock would give us any benefits.

During the spike I converted a selection of our tests to Spock tests to see what it looks like on a real codebase. I had very specific things I wanted to try out:

  • Mocking
  • Stubbing
  • Data driven testing

In the talk I also saw useful annotation like @Requires, which I'm pretty sure we're going to use, but I don't think it's made it into a build yet.

So, get this, I'm going to write a blog post with Actual Code in. Yeah, I know, you all thought I was just a poncy evangelist these days and didn't do any real coding any more.

First up, Mocking

So, as I said, I have a number of tests checking that encoding of Java objects works the way we expect. The easiest way to test this is to mock our BSONWriter class to ensure that the right interactions are happening against it. This is a nice way to check that when you give an encoder a particular set of data, it gets serialised in the way BSON expects. These tests ended up looking something like this:

@Test
public void shouldEncodeListOfStrings() {
    final List<String> stringList = asList("Uno", "Dos", "Tres");

    context.checking(new Expectations() {{
        oneOf(bsonWriter).writeStartArray();
        oneOf(bsonWriter).writeString("Uno");
        oneOf(bsonWriter).writeString("Dos");
        oneOf(bsonWriter).writeString("Tres");
        oneOf(bsonWriter).writeEndArray();
    }});

    iterableCodec.encode(bsonWriter, stringList);
}

(view gist)

(Yeah, I'm still learning Spanish).

So that's quite nice, my test checks that given a List of Strings, they get serialised correctly. What's not great is some of the setup overhead:

public class IterableCodecTest {

    //CHECKSTYLE:OFF
    @Rule
    public final JUnitRuleMockery context = new JUnitRuleMockery();
    //CHECKSTYLE:ON
    // Mocks
    private BSONWriter bsonWriter;
    // Under test
    private final IterableCodec iterableCodec = new IterableCodec(Codecs.createDefault());

    @Before
    public void setUp() {
        context.setImposteriser(ClassImposteriser.INSTANCE);
        context.setThreadingPolicy(new Synchroniser());
        bsonWriter = context.mock(BSONWriter.class);
    }

    @Test
    public void shouldEncodeListOfStrings() {
        final List<String> stringList = asList("Uno", "Dos", "Tres");

        context.checking(new Expectations() {{
            oneOf(bsonWriter).writeStartArray();
            oneOf(bsonWriter).writeString("Uno");
            oneOf(bsonWriter).writeString("Dos");
            oneOf(bsonWriter).writeString("Tres");
            oneOf(bsonWriter).writeEndArray();
        }});

        iterableCodec.encode(bsonWriter, stringList);
    }
}

(view gist)

Obviously some of the things there are going to be ringing some people's alarm bells, but let's assume for a minute that all decisions were taken carefully and that pros and cons were weighed accordingly.

So:

  • Mocking concrete classes is not pretty in JMock, just look at that setUp method.
  • We're using the JUnitRuleMockery, which appears to be Best Practice (and means you're less likely to forget the @RunWith(JMock.class) annotation), but checkstyle hates it - Public Fields Are Bad as we all know.

But it's fine, a small amount of boilerplate for all our tests that involve mocking is an OK price to pay to have some nice tests.

I converted this test to a Spock test. Groovy purists will notice that it's still very Java-y, and that's intentional - I want these tests, at least at this stage while we're getting used to it, to be familiar to Java programmers, our main audience.

class IterableCodecSpecification extends Specification {
    private BSONWriter bsonWriter = Mock();

    @Subject
    private final IterableCodec iterableCodec = new IterableCodec(Codecs.createDefault());

    public void 'should encode list of strings'() {
        setup:
        List<String> stringList = ['Uno', 'Dos', 'Tres'];

        when:
        iterableCodec.encode(bsonWriter, stringList);

        then:
        1 * bsonWriter.writeStartArray();
        1 * bsonWriter.writeString('Uno');
        1 * bsonWriter.writeString('Dos');
        1 * bsonWriter.writeString('Tres');
        1 * bsonWriter.writeEndArray();
    }
}

(view gist)

Some initial observations:

  • It's a really simple thing, but I like having the @Subject annotation on the thing you're testing. In theory it should be obvious which of your fields or variables is the subject under test, but in practice that's not always true.
  • Although it freaks me out as someone who's been doing Java for the last 15 years, I really like the String for method name - although in this case it's the same as the JMock/JUnit equivalent, it gives a lot more flexibility for describing the purpose of this test.
  • Mocking is painless, with a simple call to Mock(), even though we're still mocking concrete classes (this is done simply by adding cglib and obgenesis to the dependencies).
  • I love that the phases of Spock (setup: when: then:) document the different parts of the test while also being the useful magic keywords which tell Spock how to run the test. I know other frameworks provide this, but we've been working with JUnit and I've been in the habit of commenting my steps with //given //when //then.
  • Thanks to Groovy, creation of lists is less boiler plate (line 9). Not a big deal, but just makes it easier to read.
  • I've got very used to the way expectations are set up in JMock, but I have to say that 1 * bsonWriter.blahblahblah() is much more readable.
  • I love that everything after then: is an assertion, I think it makes it really clear what you expect to happen after you invoke the thing you're testing.

So mocking is awesome. What's next?