Archive for the ‘ Databases ’ Category

Not all NoSQL are created equal


The term “NoSQL” has been enjoying a lot of attention lately. Basically “NoSQL” is an unfortunate catch-all phrase used to describe a large number of new database technologies that are gaining in popularity. But not all NoSQL technologies are the same. The main characteristic they all share is that they are not based on relational algebra and do not use SQL as a query language. But that’s about it. If you write an application and use the filesystem as a persistence mechanism you’re “NoSQL”. If you use a graph database you’re “NoSQL”. If you use CouchDB, Cassandra, or MongoDB you’re NoSQL. So “NoSQL” is a little bit of a no-meaning phrase, or is at least a phrase of little meaning. In order to meaningfully understand the new set of database technologies making their way onto the scene, we need move past the “NoSQL” phrase and look at these technologies a little more deeply. When should you use these technologies, if ever? Are any of these technologies a real replacement for SQL?

Why NoSQL?

But first, a quick point on why there has been so much development of non-relational database technologies in the last few years. Basically the first decade of modern web application programming taught us a lot of lessons, and one of them was that most programmers have a need/hate relationship relational database systems. Why? Traditionally there are two main reasons:

  • The classic object / relational “impedance mismatch”
  • Scalability limits of the traditional RDBMS design

Why not NoSQL?

Most NoSQL technologies give something up to gain the properties they have. There is no universal reason not to use a particular NoSQL technology but each has trade-offs. For instance, most NoSQL databases violate ACID in some way and most do not support complex transactions like SQL databases do. Many scale by relying on “eventual consistency” semantics within a database cluster. No NoSQL solution that I know of allows you to compute the cartesian product of two data sets (a.k.a. joins).  It is important to note that there are many use cases where a traditional RDBMS system is the exact right tool and that alternate NoSQL solutions would not be a fit. But you need to know you are in one of those cases, and be explicit about your choice. If you need joins (and I mean really need joins, not following pointers) or need multi-operation transactions (insert, insert, delete, update, … whoops, roll all that back…) then you probably want a standard RDBMS system.

That said most modern applications do not need a relational database.

An oversimplified (but mostly accurate) lay of the land

Basically I would partition the NoSQL landscape into three buckets:

The key-value stores

Key-value storage models can be thought of conceptually as enormous hash-maps or key/value mapping tables. Although they differ in some ways, BigTable, Cassandra, …, can all be thought of as belonging to this category. For the most part the values stored in these systems are somewhat opaque, often just blobs of binary data.

The document databases

When I first heard the phrase “document database” I imagined a database for storing .pdf and .xls files. You know, a database for storing “documents”. That is not what a document database is. A “document” in this context is actually more like an object or data-structure, and in the case of the two main document database players CouchDB and MongoDB, documents are JSON objects.

Behold the blog post:

{
    id: 1234,
    author: { name: “Bob Jones”, email: “b@b.com” },
    post: “In these troubled times I like to …“,
    date: { $date: “2010-07-12 13:23UTC” },
    location: [ -121.2322, 42.1223222 ],
    rating: 2.2,
    comments: [
       { user: “jgs32@hotmail.com”,
         upVotes: 22,
         downVotes: 14,
         text: “Great point! I agree” },
       { user: “holly.davidson@gmail.com”,
         upVotes: 421,
         downVotes: 22,
         text: “You are a moron” }
    ] ,
    tags: [ “politics”, “Virginia” ]
 }

From an expressiveness standpoint document databases are clearly higher-level than flat key-value stores. The recursive structure of JSON is more general than the key-value model and can represent anything a key-value model can represent and more.

The rest

The rest of the pack consist of more esoteric, less generically applicable technologies. Graph databases like neo4j, and RAM caches like Redis come to mind. Many of these are fantastic technologies, but they are more specialized in their purpose and less likely to emerge as a general purpose replacement to your SQL database.

Documents Databases vs. Key-value stores

Key-value systems like Cassandra are fantastic distributed content stores and scale really well, but as a wise colleague of mine observed, they are technologies you might build  a database on top of, but they are not general purpose databases.

A key-value systems like Cassandra, that store opaque byte arrays as values, don’t know enough about the data stored therein to be able to provide much of a query language. You can index the blobs, and retrieve them via those indexes, but that is very much less expressive than what most need from a general-purpose database. Document databases, by contrast, have a rich set of value types that can be stored (numbers, strings, dates, arrays, nested objects, references, …), and the database system has an understanding of the structured data elements stored inside.

Modeling data in document databases is also a lot easier to think about. JSON is a very natural way to build data abstractions since it is so similar to the object oriented structures programmers build every day within Java, Python, Ruby, C# etc… By contrast I find that most straight key-value or columnar systems make you contort the way you think about your data in a way that is un-natural.

Here is what I think is one of the best articles out there on describing basic Cassandra data modeling concepts:

http://schabby.de/cassandra-getting-started/

Talk about over complicating something (click the link to see what I mean). Do NOT use this kind of kit for everyday stuff unless you like pain.

I would much rather model a user like this:

{
   id : 1234,
   email" : "foo@bar.com",
   address : {
      street : "123 ABC Way",
      city : "San Francisco",
      state : "CA"
      zip : 94114
   }
}

and do without sacrificing horizontal scalability (i.e. by using Couch or MongoDB).

CouchDB vs. MongoDB

CouchDB has some cool niche use cases in which it shines. It can be embedded into mobile devices, and has done a really good job with master-master replication. But for general-purpose use as a “MySQL replacement” it MongoDB is really what you want.

The two main comparison points (although there are more):

  • The only programatic interface to CouchDB is its REST API. MongoDB provides a driver in each client programming language that implements an efficient binary wire protocol. Result: Couch is MUCH slower than MongoDB (which blazes).
  • MongoDB has a general purpose query language like you have with a traditional SQL database (tuned via indexes). With CouchDB you define “views” specified by map-reduce functions.  If you want to query by ‘username’ you need to tell CouchDB ahead of time (a reader pointed out that you can create temporary views in CouchDB which is more like an ad-hoc query, but you are still writing a mini-program to define a view, and the whole thing feels like an after-thought).

Those two differences right there create a gulf that puts MongoDB is a category apart from CouchDB as the most complete document DB solution out there right now.

So what should I use MongoDB for?

I really think that if MongoDB existed ten years ago it (or something very like it) would store the majority of the types of data backing traditional “web apps” both consumer and enterprise. What is it great for?

  • Accounts / Users
  • Access control rules
  • CMS systems (i.e. content trees / graphs)
  • Web form data
  • Product catalogs
  • Blogs
  • System and application configuration
  • Session state
  • Logging

I’m sure there are other great uses.

Takeaway

There are many use-cases where a relational database and a traditional RDBMS are the right tool, but it is probably not your use case. Unlike ten years ago, there are a myriad of alternate database technologies available. It presents both opportunity and confusion. Know what you are trying to do, and really think about whether you need a low-level key-value store (are you implementing Youtube.com?). For most cases, you want a JSON document store, and MongoDB is probably the best out there for what you are building.

Advertisements