So, I’ve finished my first full week in the new job and I’ve learnt lots of new stuff. Which is great, because that’s usually why you change jobs.
I’m learning a lot about these new-fangled NoSQL
database thingies. The LMAX architecture
was based on keeping everything in memory and reducing the waits for IO - messages were journalled to disk, and reads and writes to the MySQL database were off the critical path. Therefore doing anything radical to the storage side of the architecture was just not high on the list of priorities.
Everything I knew about NoSQL I learnt from the various conferences I’ve been going to in the last year, and even then that’s limited - without a business reason to pursue knowledge I know it’ll just leak out of my brain, so I avoid sessions with no immediate applicability to me.
Let’s summarise what I knew about NoSQL databases before last week:
- They don’t use SQL. Who knew?
- There are different flavours. There’s a graphy one and key-value things and… others…
- They’re “scalable” (yes, yes, it’s web scale).
- Some/many/all(?) embrace the idea of eventual consistency
I was suspicious of the hype surrounding NoSQL, partly because it’s associated with the meaningless marketing term “Big Data” and partly because I’m a cynic that sneers at things that get too popular. Here’s what I think when I hear the following terms:
- Cloud - Fire your systems people and ditch your comms room!
- Big Data - Parse Twitter in order to learn how to read your customer’s minds!
- NoSQL - Stop paying Oracle!
- Functional - We couldn’t get good enough at mainstream programming languages so we switched to something more difficult!
I don’t know if it’s healthy to be this cynical, but I’m too old to jump on every bandwagon that comes along.
Anyway. Back to the people who now pay my bills.
It’s unfortunate that the lack of SQL is the thing that captured the imagination, rather than the lack of tables and a relational structure. SQL was never (in my mind) a particularly evil thing, it’s a pretty good language for saying “I want this stuff from this place that fits these criteria”, and that’s something we’re going to have to do at some point whatever the technology.
It’s rather more important that it’s the structure of the data that’s different in NoSQL databases.
In a traditional relational databases you have tables, and relationships between those tables are achieved with foreign keys. I’m starting to think of these as something kind of grid-shaped with links between them:
|Series of database tables and their relationships. Honest.
(Yes, I’m experimenting again. This time with my shiny new iPad, a stylus and Penultimate. It’s good for ad-hoc drawings, but lacks the precision of the graphics tablet and flexibility of GIMP).
At the very high level, it seems like there are four (ish) types of NoSQL databases:
- Column Family
Column family databases feel to me, as a newbie to the field, similar to key/value, which I’ll come on to. I’ve mostly heard Cassandra used as an example of this type of NoSQL database. I guess the way I think of this, and of course I could be wrong/over-simplifying, is a unique key linked to a set of key/values:
Which I’m translating into groups of key/value pairs, with a the ID as a sort of header:
|Key/value pairs grouped by ID
You need the key in order to look up all the details about me. The way I hear it, it’s great for writing data, but it’s less flexible for ad-hoc queries.
These types of NoSQL database (e.g. Riak) are pretty much as schema-less as you get - just dump key-value pairs into them. To be honest, the best description I found was on dba.stackexchange.com, so I’m not going to re-write that with my (at this point) limited understanding.
|Never ending lists of key/values
From what I’ve heard so far, both Key/Value and Column Family databases embrace eventual consistency. I don’t know how much of that is a function of their data model and how much is decided by the individual products. For some people eventual consistency is deal-breaker, but in many cases it seems to me that it’s just a matter of getting your head around this and designing your application appropriately.
|Graph of nodes with annotated relationships
I’d be interested in what the architectural trade-offs in using this model are.
Now MongoDB falls into category four, the document database. And as a NoSQL n00b, this is now the product and area I know most about, and am clearly going to be more excited about since 10gen are indoctrinating me in the MongoDB way.
Documents are a familiar structure for developers, especially if they’ve been working with JSON. So, a document might be:
To me, this looks like it maps onto to my domain-shaped Object Model more easily than a relational database, which always needs some sort of O-R mapping (whether you do this with hibernate or use Spring to do it yourself, you’re still mapping tables into objects and vice versa). What I like about the document format is the nested sub-documents for data that belongs together. In relational databases you often end up denormalising for performance anyway, so why not just accept that up front and have it as part of the thing you’re storing?
|A document with sub-documents. Think XML/JSON.
This does have a cost, of course - nothing is without trade-offs. Every time you request this document, you get the whole lot. You can’t have the person without the address. So, you do need to understand the relationships (still) and whether you’re usually going to want to get all that data at the same time or whether you might want to make two separate calls.
Which brings me on to another thing which is familiar from relational days - foreign keys. A field in your document can be the ID of another document, so you can follow the links through and retrieve other documents associated with the starting one. Again, there are trade-offs here - each link you follow is a different request to the database. These database requests can be very quick, but if you wanted this data every time, you’d probably want it embeded in your first document to save the additional call. I guess it’s a latency vs throughput question really - a single query which returns a chunky document, or multiple queries that return smaller ones.
|Documents can link to other documents.
So schema design is still important in document databases even if you don’t have a relational schema. No new technology is an excuse to stop thinking about the problem you’re trying to solve and understanding the tradeoffs in design.
One of the advantages, it seems, of something like MongoDB over some of the key/value databases is the ability to write ad-hoc queries and to tune for those queries. The data is structured (it’s in a document) and it doesn’t have to be in the same structure every time - not every document relating to a person needs all the fields that another person might have. But you can still query for people who have blue cars or people who live in London, or people who’s surnames begin with G. If you find yourself doing the same query a number of times, you can add indexes to MongoDB the same way you would a relational database.
Semms like I’m getting into more of the nitty-gritty MongoDB details, so I’ll stop there and leave that for another time.
Classing a whole swathe of products as “NoSQL” is misleading and confusing. The only thing they all share in common is that they are not traditional relational databases. Other than that, some of them are as different from each other as they are from relational databases. I haven’t even mentioned caching technologies - these products have functionality which overlaps with NoSQL databases as well. But even then, the purposes are somewhat different, and not even mutually exclusive.
As with anything, it’s really important to understand the strengths and weaknesses of a technology, and the demands of your domain. These different ways of organising data, and different products, are going to perform really well in certain circumstances, and pretty poorly when used in others. Getting an understanding of what those strengths and weaknesses are is going to be important in making the correct product/architecture/design decisions.
None of this information is new, there’s a lot of material on the web about the different types of NoSQL databases. I’m writing it more for my own benefit than anything else, my memory is notoriously shocking. For more in-depth (and probably more accurate reading) there’s:
- Martin Fowler’s NoSQL Distilled
- …and his introduction to the subject
- Tim Berglund (@tlberglund) did a great overview of three types at JAX London last week. There’s a video of the same content (different conference) here.
- http://nosql-database.org/ appears to list all the products that fall under the massive umbrella, but isn’t the most usable of sites.
- And yes, I used Wikipedia. Which is probably where I went wrong…