NoSQL is a Stupid Name

So, I’ve finished my first full week in the new job and I’ve learnt lots of new stuff. Which is great, because that’s usually why you change jobs.

I’m learning a lot about these new-fangled NoSQL database thingies. The LMAX architecture was based on keeping everything in memory and reducing the waits for IO - messages were journalled to disk, and reads and writes to the MySQL database were off the critical path. Therefore doing anything radical to the storage side of the architecture was just not high on the list of priorities.

Everything I knew about NoSQL I learnt from the various conferences I’ve been going to in the last year, and even then that’s limited - without a business reason to pursue knowledge I know it’ll just leak out of my brain, so I avoid sessions with no immediate applicability to me.

Let’s summarise what I knew about NoSQL databases before last week:

  • They don’t use SQL. Who knew? 
  • There are different flavours.  There’s a graphy one and key-value things and… others…
  • They’re “scalable” (yes, yes, it’s web scale). 
  • Some/many/all(?) embrace the idea of eventual consistency 

I was suspicious of the hype surrounding NoSQL, partly because it’s associated with the meaningless marketing term “Big Data” and partly because I’m a cynic that sneers at things that get too popular. Here’s what I think when I hear the following terms:

  • Cloud - Fire your systems people and ditch your comms room!
  • Big Data - Parse Twitter in order to learn how to read your customer’s minds!
  • NoSQL - Stop paying Oracle!
  • Functional - We couldn’t get good enough at mainstream programming languages so we switched to something more difficult!

I don’t know if it’s healthy to be this cynical, but I’m too old to jump on every bandwagon that comes along.

Anyway. Back to the people who now pay my bills.

It’s unfortunate that the lack of SQL is the thing that captured the imagination, rather than the lack of tables and a relational structure. SQL was never (in my mind) a particularly evil thing, it’s a pretty good language for saying “I want this stuff from this place that fits these criteria”, and that’s something we’re going to have to do at some point whatever the technology.

It’s rather more important that it’s the structure of the data that’s different in NoSQL databases.
In a traditional relational databases you have tables, and relationships between those tables are achieved with foreign keys. I’m starting to think of these as something kind of grid-shaped with links between them:

Series of database tables and their relationships.  Honest.

(Yes, I’m experimenting again. This time with my shiny new iPad, a stylus and Penultimate. It’s good for ad-hoc drawings, but lacks the precision of the graphics tablet and flexibility of GIMP).

At the very high level, it seems like there are four (ish) types of NoSQL databases:

  1. Column Family 
  2. Key/Value 
  3. Graph 
  4. Document 

Column Family
Column family databases feel to me, as a newbie to the field, similar to key/value, which I’ll come on to. I’ve mostly heard Cassandra used as an example of this type of NoSQL database. I guess the way I think of this, and of course I could be wrong/over-simplifying, is a unique key linked to a set of key/values:

Which I’m translating into groups of key/value pairs, with a the ID as a sort of header:

Key/value pairs grouped by ID

You need the key in order to look up all the details about me. The way I hear it, it’s great for writing data, but it’s less flexible for ad-hoc queries.

These types of NoSQL database (e.g. Riak) are pretty much as schema-less as you get - just dump key-value pairs into them. To be honest, the best description I found was on, so I’m not going to re-write that with my (at this point) limited understanding.

Never ending lists of key/values

From what I’ve heard so far, both Key/Value and Column Family databases embrace eventual consistency. I don’t know how much of that is a function of their data model and how much is decided by the individual products. For some people eventual consistency is deal-breaker, but in many cases it seems to me that it’s just a matter of getting your head around this and designing your application appropriately.

I came across graph databases when I stumbled across Neo4j, chatting to some of the very smart guys there. A graph database lets you model you data as a series of nodes and relationships. And if I think about it, this is not a massive step from either relational models or object models. It doesn’t just apply well to the social networking domain (where it’s very easy to think in terms of users and their relationships), in actual fact lots of things we design could be modelled this way. Not having used it, I’m not sure just how much of a mental leap you need to take to start thinking that way, but it seems like it might be a good fit for many problems.

Graph of nodes with annotated relationships

I’d be interested in what the architectural trade-offs in using this model are.

Now MongoDB falls into category four, the document database. And as a NoSQL n00b, this is now the product and area I know most about, and am clearly going to be more excited about since 10gen are indoctrinating me in the MongoDB way.

Documents are a familiar structure for developers, especially if they’ve been working with JSON. So, a document might be:

To me, this looks like it maps onto to my domain-shaped Object Model more easily than a relational database, which always needs some sort of O-R mapping (whether you do this with hibernate or use Spring to do it yourself, you’re still mapping tables into objects and vice versa). What I like about the document format is the nested sub-documents for data that belongs together. In relational databases you often end up denormalising for performance anyway, so why not just accept that up front and have it as part of the thing you’re storing?

A document with sub-documents.    Think XML/JSON.

This does have a cost, of course - nothing is without trade-offs. Every time you request this document, you get the whole lot. You can’t have the person without the address. So, you do need to understand the relationships (still) and whether you’re usually going to want to get all that data at the same time or whether you might want to make two separate calls.

Which brings me on to another thing which is familiar from relational days - foreign keys. A field in your document can be the ID of another document, so you can follow the links through and retrieve other documents associated with the starting one. Again, there are trade-offs here - each link you follow is a different request to the database. These database requests can be very quick, but if you wanted this data every time, you’d probably want it embeded in your first document to save the additional call. I guess it’s a latency vs throughput question really - a single query which returns a chunky document, or multiple queries that return smaller ones.

Documents can link to other documents.

So schema design is still important in document databases even if you don’t have a relational schema. No new technology is an excuse to stop thinking about the problem you’re trying to solve and understanding the tradeoffs in design.

One of the advantages, it seems, of something like MongoDB over some of the key/value databases is the ability to write ad-hoc queries and to tune for those queries. The data is structured (it’s in a document) and it doesn’t have to be in the same structure every time - not every document relating to a person needs all the fields that another person might have. But you can still query for people who have blue cars or people who live in London, or people who’s surnames begin with G. If you find yourself doing the same query a number of times, you can add indexes to MongoDB the same way you would a relational database.

Semms like I’m getting into more of the nitty-gritty MongoDB details, so I’ll stop there and leave that for another time.

In Summary
Classing a whole swathe of products as “NoSQL” is misleading and confusing.  The only thing they all share in common is that they are not traditional relational databases.  Other than that, some of them are as different from each other as they are from relational databases.  I haven’t even mentioned caching technologies - these products have functionality which overlaps with NoSQL databases as well.  But even then, the purposes are somewhat different, and not even mutually exclusive.

As with anything, it’s really important to understand the strengths and weaknesses of a technology, and the demands of your domain.  These different ways of organising data, and different products, are going to perform really well in certain circumstances, and pretty poorly when used in others.  Getting an understanding of what those strengths and weaknesses are is going to be important in making the correct product/architecture/design decisions.

None of this information is new, there’s a lot of material on the web about the different types of NoSQL databases. I’m writing it more for my own benefit than anything else, my memory is notoriously shocking.  For more in-depth (and probably more accurate reading) there’s:

  • Martin Fowler’s NoSQL Distilled""
  • …and his introduction to the subject
  • Tim Berglund (@tlberglund) did a great overview of three types at JAX London last week.  There’s a video of the same content (different conference) here.
  • appears to list all the products that fall under the massive umbrella, but isn’t the most usable of sites.
  • And yes, I used Wikipedia.  Which is probably where I went wrong…

JAX London 2012

Seemed like a quiet conference this year.  Not really sure why, maybe it was the layout of the massive (and extremely dark) main room; maybe it was the awkward L-shape of the communal space; or maybe this year people were more interested in listening to the (really very good) sessions rather than participating or meeting other people.  Whatever the reason, it felt quiet and almost low-key.

Performance seemed pretty high on the agenda, as you’d expect from a London conference, with a number of things on offer:

  • A great keynote from Kirk Pepperdine and Martijn Verburg, covering a massive range of things to care about when thinking about performance on the first night
  • A high-level talk about Java Performance from yours truly (which I may run again for the LJC if there’s interest, but it’s more likely to be a one-off)
  • A deep dive into writing lock-free coding by Mike Barker
  • And a talk from Kirk exploring your GC logs.
It was great to see a number of LJC regulars presenting, especially as my own schedule has been so crazy I haven’t seen many of them for a long time.  So I missed sessions from Bruce, John, Sandro, Russell, James & Richard, but I heard good things about the sessions and was really pleased to chat to all of them.
The highlight of the conference for me though was Brian Goetz's keynote and subsequent session on lambdas.  I’ve been looking into lambdas because I think it’s a really interesting addition to the language and I’ve heard a lot of noise about them.  What I thought was most interesting about Brian’s talks though was less the information on what they were and how to use them, and more the challenges that face language designers when they have a language which is used by 10 million developers and has been going for nearly 20 years.  Ouch.  It’s amazing they get anything done, let alone something like lambdas which the language was never designed to support.
In keeping with the new job, I went to a few sessions on the Big Data Con - frankly an unfortunate name I feel.  Brendan's Mongo & JVM talk was useful, especially given that I might actually be presenting that at some point.  What I’d love to see though is a more interesting story around the Java driver.  It seems people believe the Java driver needs a little love.
The other interesting NoSql talk was from Tim Berglund's NoSql Smackdown, which was a really great way of highlighting that the NoSql databases are not all solving the same types of problems.  The room was packed and the questions were intelligent, so it seems there’s still a lot of interest in this kind of introduction to the technology.
Lessons learned:
  1. Commuting through Victoria Station sucks.  I knew this last year but it’s just got worse.
  2. The iPad + stylus combo is not as precise as the graphics tablet, so I’m probably going back to that for illustrations.  But I’d still love to do free-drawing with the iPad on the projector at some point.
  3. Not everyone can follow the deep-dive tech talks, but they still prefer them to introductory talks, maybe because they feel like they’re learning something (well, that’s my opinion).
I took practically no photos because I kept forgetting I had my camera.  I think it’s the weird subterranean effect of the hotel basement.  Either that or I’ve turned into a conference zombie - not an unlikely suggestion.  And I’ve still got Devoxx round the corner…

And for my next trick….

The time has come, and I’m moving on from LMAX. I’ve had an incredible (nearly) four years working for one of the most radical finance firms in the world, during which time I feel I’ve learnt more than the rest of my work experience put together, and had the pleasure to work with some of the smartest and most interesting people I’ve ever met.

I’ve been invited to join 10gen and their MongoDB driver team, a challenge I am really looking forward to. After years in finance and in the IT departments of other organisations, I’m finally working for a product firm, and an open source one. I expect it will be very different from anything else I’ve been involved in.

I hope this means I will be blogging even more, and that I’ll have opportunities to abuse my graphics tablet producing more ridiculous scrawlings. I also hope this will give me an opportunity to meet more people as I travel around. So, as if this were a goodbye e-mail to the company or an out-of-office reply, I should finish with: any further enquiries about the Disruptor should be addressed to the Google Groups list - there are people on there waaay smarter than me anyway.

![Lots of shiny new goodies!]( "Lots of shiny new goodies!")
Lots of shiny new goodies