Using Groovy to import XML into MongoDB

This year I've been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts.

Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail.

Firstly I want to clarify that this is not my original work - I stole borrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There's a section in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it's this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM.

I'm assuming if you've worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I did not want to do this. When I wrote the demo I wasn't really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB.

Goal of the Script

  • Parse an XML file containing open street map data of all coffee shops.
  • Extract latitude and longitude XML attributes and transform into MongoDB GeoJSON.
  • Perform some basic validation on the coffee shop data from the XML.
  • Insert into MongoDB.
  • Make sure MongoDB knows this contains query-able geolocation data.

The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne:


Firstly, we need data

I used the same service Ross used in his blog post to obtain the XML file containing "all" coffee shops around the world. Now, the open street map data is somewhat... raw and unstructured (which is why MongoDB is such a great tool for storing it), so I'm not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using*[amenity=cafe][cuisine=coffee_shop]

The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results.

Each XML record looks something like:

<node id="178821166" lat="40.4167226" lon="-3.7069112">
    <tag k="amenity" v="cafe"/>
    <tag k="cuisine" v="coffee_shop"/>
    <tag k="name" v="Chocolatería San Ginés"/>
    <tag k="wheelchair" v="limited"/>
    <tag k="wikipedia" v="es:Chocolatería San Ginés"/>

Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data).


Script Initialisation

Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we're going to use the MongoDB Java Driver to get the collection that we're interested in, and then drop it.

There's two interesting things to note here:

  • This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn't need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well.
  • You don't need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you're interested in, and if it doesn't already exist, the server will create them for you.

In this example, we're just using the default constructor for the MongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java.

Turn the XML into something MongoDB-shaped

Parse & Transform XML

So next we're going to use Groovy's XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every "node" element in the XML.

Create GeoJSON

Create GeoJSON Since MongoDB documents are effectively just maps of key-value pairs, we're going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like:

<node id="18464077" lat="-33.8911183" lon="151.1958773">

We're going to save the ID as a value for a new field called openStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like:

{ 'location' : { 'coordinates': [<longitude>, <latitude>],
                 'type'       : 'Point' } }

In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places.

Insert Remaining Fields

Insert Remaining Fields

Validate Field Name

Now for every tag element in the XML, we get the k attribute and check if it's a valid field name for MongoDB (it won't let us insert fields with a dot in, and we don't want to override our carefully constructed location field). If so we simply add this key as the field and its the matching v attribute as the value into the map. This effectively copies the OpenStreetMap key/value data into key/value pairs in the MongoDB document so we don't lose any data, but we also don't do anything particularly interesting to transform it.

Save Into MongoDB

Save Into MongoDB

Finally, once we've created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it's actually much easier just to use the pretty Groovy syntax to check for a key called name in coffeeShop.

When we want to insert the Map we need to turn this into a BasicDBObject, the Java Driver's document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there's a Groovy syntax which would effectively do the same thing, which you might prefer:

collection.insert(coffeeShop as BasicDBObject)

Tell MongoDB that we want to perform Geo queries on this data

Add Geo Index

Because we're going to do a nearSphere query on this data, we need to add a "2dsphere" index on our location field. We created the location field as GeoJSON, so all we need to do is call createIndex for this field.


So that's it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.

What could possibly go wrong? (GOTO Chicago)

At GOTO Chicago, I was given the chance to chat a bit about the presentation I was giving, which happens to be the same one I’m giving at a number of conferences this year (although of course I’m evolving it as I go along).

The presentation leaves very little time for anything other than coding, as it’s quite challenging to create a full app in 50 minutes, so it was great to have the chance to talk about the motivations for the demo

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;"> <iframe src="; style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">

The video of the actual talk is also available now:

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;"> <iframe src="; style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video">

At the beginning it doesn’t clearly show the screen, but it does improve. You can see an earlier version from the Joy of Coding as well, so if something’s not clear on one of the videos, hopefully it’s better in the other.

The code for the Chicago version is on Github, and if you look through the history you can see how it builds up, the same as it does in the demo.

QCon London 2014

Wow. My 4th QCon London. That’s not bad. And every time, it’s a different experience (if you must, see my blogs for 2013, 2012, and even 2007 (part 1 & part 2 - how cute was I? "agile seems like a jolly good idea; automated testing appears to be important")).

I can’t even tell you what I did on the first day, I was mostly panicking about my presentation - I was inspired after my trip to New York last month to change my talk at the last (responsible?) minute and do a live coding session, something much more technical than my recent talks. I’ll leave the details for a separate blog post though, when the video comes out.

The thing that stands out for me from Wednesday though was Damian Conway programming Conway's Game of Life in Klingon. Yeah. Just find the video and watch it, the man is a genius.

Damian Conway, Life, The Universe, and Everything

The Thursday keynote was inspiring too from a totally different point of view - Tim Lister of Peopleware fame shared stories from his career, and I came away from that really happy I work as a technologist, but with an increased desire to learn off other amazing people.

Tim Lister @QCon London 2014

Not Only Java track - I’m on the programme committee for QCon, and this year we wanted to cover leading edge technologies (as always) but we didn’t want to slice things into strict technology silos (interruption: argh! the person in front of me nearly destroyed my laptop by suddenly moving their chair back! Why do people bother in economy on a morning flight?). So I wanted the Java track to be more representative of what today’s Java programmers care about - for the programmers of course, but also because I know there are architects and team leads at QCon who might not realise how things have moved on with the language, and how much polyglot programming we do these days.

Martin kicked it off with a great history lesson on the progress (or occasionally, lack of it) in Java. He begged us to study and understand Set Theory, to use async design, to think of the users of our APIs, and, most of all, to design nice, clean code.

Next up, Eva took us through the fundamentals of Garbage Collection - this might not seem like a cutting edge subject these days, but it’s one of the most misunderstood subjects for Java programmers. Eva gave us a really great, understandable view of the different types of garbage collectors, how they work, and their pros and cons. She left us with a call to arms to not simply let other people try and solve this problem, but to get stuck in and contribute ourselves, via the OpenJDK.

Trisha Gee @QCon London 2014

After lunch was my nerve-wracking live coding session, putting together a full stack end-to-end web app using AngularJS, HTML5, Java and MongoDB. It only went wrong twice, and people seemed to like it. I’ll post the video in another blog post as soon as it's publicly available. Code is available on github.

We’ve been playing with the open spaces idea at QCon. The Java one only had a few people in it, but that gave everyone a chance to speak at least. We covered Java 8 (the Good and the Bad); other JVM languages; and UIs for Java (Javascript or GWT?). And I plugged the work the LJC does in London, of course.

After this Bodil blew me away creating a My Little Pony game using RX in the browser. ‘Nuff said.

Finally, Simon Ritter gave us a view of the Java 8 features most likely to impact the way Java developers think about software design - lambdas and streams. I thought this was a really great introduction to the concepts if you haven’t seen them before, and with concrete examples that showed how we should be using them. If you're not already looking at lambdas and streams, you should be - even if you're not going to be using Java 8 yet, it's worth getting a heads up on how it's going to impact our programming style.

I’m very pleased with the way the Java track turned out on the day - every speaker was first class, a wide range of topics was covered, and I, for one, learnt something in every presentation.

To finish off the day, Emma Langman gave an awesome keynote about how people are the messy bit of your system, and how they’re never rational and you shouldn’t expect them to be. I also highly recommend this talk, especially if you’re a techy and you’ve found yourself in some sort of management or team lead position.

Sadly I couldn't stay for day three of the conference, I had to fly off to the Joy of Coding to re-give the live coding presentation there. Because, if you’re going to do something terrifying and doomed to failure, you might as well do it twice.

QCon is an expensive conference, especially compared to the developer-friendly prices of something like DevoxxUK, but for getting a big picture of where the industry is, of things you might be missing, for learning hard core technical skills and understanding the important of the fluffy-people-stuff, and finally for meeting a wide range of people from developers to CTOs, I think you'd be hard pushed to find something better in London. IMHO (and remember, I did disclose I'm on the programme committee).

And although it's really hard work putting together the programme for a track like this, and although both times I've said I'm Never Doing It Again, when it goes this well it makes you want to do it all over again. After a break. A looong break.

In my day…

Web development has changed a lot.

I was aware that there have been many changes in the last few years, and I’ve seen maturity come to web platforms in the form of standardisation and common reusable libraries and frameworks - and I don’t mean reusable in the way we used to “reuse” stuff by nicking it off other people’s websites when we saw something cool.

I used to be a web developer. Sort of. Some times I’ve been on the bleeding edge, and others... I remember using JavaScript to call back-end services with an XML payload before people were using the term AJAX, but I also remember working on an enterprise um... “classic”... JSP application only “recently” - in fact that was probably the last job where I did anything that looked like web development.

So this blog post is going to chart the progress of web development through my own experience. Of course, this doesn’t by any means cover the whole spectrum, but I think my experience has been not unusual for a Java programming working through the noughties.

Over the course of my career I moved further away from the UI, because certainly early on the money and status was in “back end”, whatever that means, and not “front end”. Which is ridiculous, really, especially as back then you couldn’t really follow best practices and clean code and test first and all that awesome stuff when doing front end development because none of the browsers played by the rules and frankly if you got it working at all you were a bloody genius. And that’s not even considering the fact that as a “front end” developer you should be thinking about actual real human beings who use your product, and actual real human beings are messy things and understanding them is not (we’re told) traditionally a domain that we developers are naturally proficient in.

Anyway, I digress. This was supposed to be a history lesson. Or a nostalgia trip. Or possibly Ranty Trish waving her walking stick in the air and shouting “You kids don’t know how good you’ve got it these days”. If nothing else, I hope that it makes other “back end” developers like myself appreciate how much things have moved on.

Let’s go back to the olden days, before I’d even graduated: picture a time before smart phones - before phones were even common (I was horribly mocked at university for being poncy enough to have a mobile), before we knew if all this work we were doing to combat the millennium bug was going to stop the end of the world. I was doing my first summer internship at Ford, and a contractor from Logica (who don't seem to exist any more??) told me that if I was messing around with web pages and HTML (my friends and I had geocities-and-equivalent sites) I should look at this JavaScript thing to make my pages “dynamic”. I didn’t have to just use GIFs to bring my page to life, I could move stuff around on the page. I think I wrote a “you are in a crowded room”-type adventure game, because my background was BASIC and that’s what you do.

Actually I haven’t even mentioned that we were creating these websites to stay in touch with each other. We’d discovered guest books, and used them to write comments and share stories since we’d all moved out of our home town to go to different universities. Man, why didn’t I invent Facebook back then? That’s what we needed.


A year later, I was back at Ford doing my sandwich year-in-industry. The first project I worked during this time was a web-based reporting tool that needed to dynamically display hierarchical data. We chose JavaScript trees to render this data - my year of messing around with my website paid off, and I was able to use my “cutting edge” Javascript skills in a real production environment. Yay? The back end was CGI - I think I was writing in Perl, but don’t tell anyone that. I was learning Java at university, but this was a new language and I don’t think Ford was using it yet.

The next project was a very ambitious one - be the first car manufacturer to sell new cars on the web. Ford was well ahead of their time - the millennium bug had not killed us all, but people were barely buying books online, never mind spending tens of thousands of pounds on a car they’d never driven. But it wasn’t just ahead of its time from a business point of view, technically it was very advanced too - we used lots of “DHTML” (as we were now calling it), a new-fangled technology called ASP, and we were writing modular, reusable COMponents. We used XSLT to parse the XML from the COM objects, and the ASP figured out whether you were Netscape or Internet Explorer (Firefox wasn’t even a gleam in the inventor’s eye, and forget Chrome, I think we using Alta Vista (whaaaat? AltaVista got bought by Yahoo??) not some new-fangled search engine beginning with G) so it could use the right XSLT to turn the XML into HTML that was readable by the browser you were using. My job was to get the DHTML pages rendering and animating correctly in both IE4 and Netscape 4. That was a lot of fun for me, but also very challenging. And imagine my shock when a few months later I tested the site from the university UNIX machines to find that Netscape rendered it completely differently under UNIX. I learnt a lesson about how important it was to test on different platforms.

We had some smart Microsoft people helping us out with this project, and, because it was 2000 and the dot com crash hadn’t happened just yet, we also had a lot of young, overpaid, overconfident contractors who believed anything was possible. I learnt a lot during this time, not just about the technology, but also about different approaches to shaping your IT career. And about how much you could earn before you were 25. I was definitely going to be a programmer when I left university the next year.

Yeah, so... I graduated in 2001. If you were around then, you’ll remember that getting a job was a bit more difficult than I had anticipated, especially as these young, overpaid contractors were now desperately grabbing anything they could find. But that’s a story for another day.

I didn’t go back to Ford straight away, I’d “been there and done that”. I worked on the website for Common Purpose. On the first day, they sat me down with a book on JSP and Servlets, and that was my reading material for the next few weeks. If I’d been fresh out of university where we’d been doing Applets, and where I’d written a Swing app on the side for my Dad’s school, this would have been a big mindset change for me. But having worked on the ASPs it wasn’t such a big shift. I did, however, like how JSPs and servlets made the separation between the view and all-of-the-other-logic-stuff a bit clearer - back in ASP-land we’d settled on a convention of dealing with the form data from the previous page in the first part of the ASP, and rendering the new page in the second part. To this day I still don’t know what we should have been doing instead. But in JSP-land it only took me... I dunno, about 6 months I think, to get the website up and running. The most difficult section was registrations. And yes, I was a graduate, and yes, I was new, but that was a good turnaround for a web application “in those days”.

In my spare time I used what I’d learnt on the blews website. I even had a section where people could log in and comment on photos - we had whole conversations on this website. It was a way for me and my friends to stay in touch. If I’d cracked the photo-uploading instead of it being a manual process for me, I would have invented Facebook. If only I’d known....

The work dried up and there was nothing else for a graduate in the early noughties, so I went back to Ford. My first role back I picked the same technologies we’d been using before - XML, XSLT, only this time we were using JSPs instead of ASP. Our project had a very tight budget and we’d worked out that using open source Java technologies and running the application on one of the many UNIX machines lying around the place was a lot cheaper than the Microsoft solution. I think we were the first team in Ford Europe to pick Java at a time when the recommended approach was Microsoft. We delivered on time and under budget, and Java was the way forward for the department from then on. But on this project I met a guy who would impact my career probably more than he even realises, a guy I’d work with again later. He told me that in Java we no longer used Vector by default, but ArrayList (whaaat? What’s an ArrayList? I had no idea what the differences were between Java 1.1, which we’d learnt at university, and Java 1.2, which was now standard). And questioned my choice of XML/XSL. Although I’d been learning new technologies and growing, he was the one who made it clear to me that I needed to keep myself ahead of the curve with the technologies I was using, or planned to use, if I wanted to stay relevant and make my life easier.

On the next project I worked with a genius guy who was definitely keeping ahead of the curve - he was using JavaScript to send small XML payloads to the server (which was coded in Java), and rendering the response in place on the page instead of reloading the whole thing. Mind. Blown. I didn’t even hear the term Ajax until a year or more later. We were fortunate in that this was once again an internal application, so we controlled the browser. This was back in the days when you wanted your users to be on IE5, as this was the only browser that supported this functionality.

The next few projects/jobs I worked on were all more pedestrian variations on the JSP theme - first I learnt Struts, which at least made us realise there was a model, a view, and a controller. Then at Touch Clarity I learnt about Spring MVC, which actually put the validation errors next to the boxes which cause the error - by default, without you having to mess around. Spring was a revelation too, a framework that really tried not to get in your way. It was also frustrating because you needed to understand its lifecycle, but it did so much heavy lifting for you, it sped up standard CRUD-app web development enormously.

A couple of years passed, during which time I was still working on a web application (for an investment bank) but I can’t for the life of me remember what technologies we used (other than Java). I know it was hard to test and I know the tricky stuff was “back end” not “front end”.

In the next project where I had any control of the technology, I picked Spring since I’d had such a good experience previously. It took 4 developers a couple of months or so to develop an admin application for a trading app. Given the previous timescales I’d worked with, this seemed pretty good. Until a few months later and two other guys on the project produced an admin app for our bank users in a matter of weeks. I can’t remember what they used, maybe Grails? But it was another demonstration of how I really should have been researching the field instead of simply sticking with what I knew, especially when I knew my knowledge was a couple of years out of date.

Fast forward to LMAX, and we were using GWT, pre-2.0 - I think this probably feels natural if you’ve been a Swing or AWT developer, but I’m still not convinced it’s a sound web platform (although I know it has improved). It was great because cross-browser was no longer an issue, but it was bad because it separates you from the underlying HTML, which means you can seriously mess up without realising. It’s also hard to use CSS correctly when you don’t have access to all the HTML components.

So we come to more-or-less the present day, as it should be fairly obvious that during the time I’ve been working on the MongoDB Java Driver I haven’t done a lot of GUI development. I’m lucky because attending lots of conferences means I see a lot more of the current-trending technologies, but up until a couple of weeks ago I hadn’t had a chance to play with any of them.

So now I’ve been trying Angular.js, Bootstrap, and UI Bootstrap. My goodness. It’s a whole 'nother world. I’m seeing at conferences and user groups that developers are increasingly polyglot, so maybe there’s no such thing as “just” a Java developer any more, but if you are “just” a Java developer, I think it could be... interesting... to get your head around some of the techniques. Since we don’t have closures, our callbacks are ugly and we tend not to program that way. Async is not something that comes naturally in a Java environment, I believe, although after working that way at LMAX I’m personally sold on it. Old-world JavaScript developers like I am/was might also find it hard to understand you can have clean, testable JavaScript code which Just Works. It didn’t even occur to me to worry about browser compatibility, and my app not only worked on my phone as well as my laptop, but looked really phone-ish and awesome with very minimal effort.

I’m currently on a plane on the way to QCon London where I’m going to demo this Brave New World of web development (together with a nice Java back end to prove how awesome Java is to work with and, of course, a MongoDB database). So it is not my intention in this post to explore what this new world looks like. But I have seen the Present, and it’s a lot better than the Past. Kids These Days don’t know how good they’ve got it - they’ve never had to struggle, to fight the browser, to hand-craft their JavaScript like we have, or had to work with raw, low-level JSPs and Servlets.

Now things are easier. There are standards, there are libraries, there are best practices and YouTube videos showing you how to create apps in 60 minutes (back in My Day I had to borrow someone else’s browser to use the Internet, and I debated for years the value of spending my own actual money on a Javascript actual paper actual book, which I could not afford). Now, you can get something quite pretty and functionally interesting, working in a lot less time than I realised. But that doesn’t mean the Kids These Days have it easier - it means there is so much more potential. Instead of beating your head against trying to get a specific version of IE to do what you want, instead of having to write separate pages for different browsers (although maybe that still goes on), you can be exploring so much further into the possible, try things that no-one else has done yet. It opens up so many interesting possibilities for apps on all platforms.

Exciting times.

So next time someone asks me “What is the de facto front-end framework for Java?” I’m going to say HTML5, CSS and JavaScript.