16 January 2013
Nosql - this is a term I have been hearing for some years now, and I pretty much ignored all the fuss. I've spent quite a large part of the last decade becoming good at building applications around Oracle databases, and I assumed "nosql" was something people fell back on when they failed to scale MySQL or Oracle properly.
However, more recently I encountered a colleague who was also well versed in all things Oracle, but sings the praises of nosql databases too.
So I asked him to give me an overview to see what it was all about, and then spent some time reading about nosql databases.
One of the first things I learned was that the term nosql does not stand for No SQL (or at least it shouldn't). A better definition is Not Only SQL, and that is interesting. It suggests that maybe there is room in you organization (or application) for more than 1 database, and they don't all have to be queried by SQL. So for me nosql refers to databases that store data in a non-relational way.
In the relational database space, you have the open source options, MySQL and Postgres and then the commercial options, Oracle, SQL Server or Sybase. Arguably, each of these databases is as good as the other, and largely the same techniques can be used to design a successful application around any of them.
In the nosql space, there are quite a few options and some of them are good at different things.
Some are good at serving data out of memory. Redis is a good example, along with Memcached. Redis can persist the in memory dataset to disk, but it ever reads from it - you need to have enough memory to hold the entire dataset. Memcached on the other hand is just a memory store - perfect for caching, but if you shut the server down the data is gone forever.
Typically these key value stores are in memory hash tables - everything is accessed by a key, and there are no other indexes.
After the key value stores come the Big Data solutions. Hadoop is the first thing that comes to mind. The premise of Big Data databases is that it is much cheaper to store many terrabytes of data than it is with a relational database. At first I struggled to understand how this could be the case, surely disk is disk, but it turns out it isn't really.
For a relational database, data generally lives on SAN which is never expected to fail. This is expensive. In Hadoop, data lives on commodity disks in cheap commodity servers - these disks and servers are expected to fail, so the data is duplicated across many nodes of the cluster, typically 3. The drawback is that it often isn't very efficient to query small amounts of data in Hadoop in real time, so it doesn't lend itself to OLTP applications. It is good for analysing log files to generate reports and many other things.
There are other databases that have elements of big data, key value and relational. Cassandra, HBase, Riak and CouchBase may fall into this category. Designed to answer queries in real time to suit OLTP applications, but designed to handle node failures gracefully. Many of these claim to be eventually consistent, use consistent hashing and some are better at different parts of the CAP theorem than others. I still have a lot to learn here I think.
Neo4J is a graph database. It stores data in graphs instead of tables, and is good for applications that lend themselves to graphs - notice how vague my description is. There is more to learn here too.
Some places I have found useful in helping me to learn about nosql: