Not all NoSQL are created equal


The term “NoSQL” has been enjoying a lot of attention lately. Basically “NoSQL” is an unfortunate catch-all phrase used to describe a large number of new database technologies that are gaining in popularity. But not all NoSQL technologies are the same. The main characteristic they all share is that they are not based on relational algebra and do not use SQL as a query language. But that’s about it. If you write an application and use the filesystem as a persistence mechanism you’re “NoSQL”. If you use a graph database you’re “NoSQL”. If you use CouchDB, Cassandra, or MongoDB you’re NoSQL. So “NoSQL” is a little bit of a no-meaning phrase, or is at least a phrase of little meaning. In order to meaningfully understand the new set of database technologies making their way onto the scene, we need move past the “NoSQL” phrase and look at these technologies a little more deeply. When should you use these technologies, if ever? Are any of these technologies a real replacement for SQL?

Why NoSQL?

But first, a quick point on why there has been so much development of non-relational database technologies in the last few years. Basically the first decade of modern web application programming taught us a lot of lessons, and one of them was that most programmers have a need/hate relationship relational database systems. Why? Traditionally there are two main reasons:

  • The classic object / relational “impedance mismatch”
  • Scalability limits of the traditional RDBMS design

Why not NoSQL?

Most NoSQL technologies give something up to gain the properties they have. There is no universal reason not to use a particular NoSQL technology but each has trade-offs. For instance, most NoSQL databases violate ACID in some way and most do not support complex transactions like SQL databases do. Many scale by relying on “eventual consistency” semantics within a database cluster. No NoSQL solution that I know of allows you to compute the cartesian product of two data sets (a.k.a. joins).  It is important to note that there are many use cases where a traditional RDBMS system is the exact right tool and that alternate NoSQL solutions would not be a fit. But you need to know you are in one of those cases, and be explicit about your choice. If you need joins (and I mean really need joins, not following pointers) or need multi-operation transactions (insert, insert, delete, update, … whoops, roll all that back…) then you probably want a standard RDBMS system.

That said most modern applications do not need a relational database.

An oversimplified (but mostly accurate) lay of the land

Basically I would partition the NoSQL landscape into three buckets:

The key-value stores

Key-value storage models can be thought of conceptually as enormous hash-maps or key/value mapping tables. Although they differ in some ways, BigTable, Cassandra, …, can all be thought of as belonging to this category. For the most part the values stored in these systems are somewhat opaque, often just blobs of binary data.

The document databases

When I first heard the phrase “document database” I imagined a database for storing .pdf and .xls files. You know, a database for storing “documents”. That is not what a document database is. A “document” in this context is actually more like an object or data-structure, and in the case of the two main document database players CouchDB and MongoDB, documents are JSON objects.

Behold the blog post:

{
    id: 1234,
    author: { name: “Bob Jones”, email: “b@b.com” },
    post: “In these troubled times I like to …“,
    date: { $date: “2010-07-12 13:23UTC” },
    location: [ -121.2322, 42.1223222 ],
    rating: 2.2,
    comments: [
       { user: “jgs32@hotmail.com”,
         upVotes: 22,
         downVotes: 14,
         text: “Great point! I agree” },
       { user: “holly.davidson@gmail.com”,
         upVotes: 421,
         downVotes: 22,
         text: “You are a moron” }
    ] ,
    tags: [ “politics”, “Virginia” ]
 }

From an expressiveness standpoint document databases are clearly higher-level than flat key-value stores. The recursive structure of JSON is more general than the key-value model and can represent anything a key-value model can represent and more.

The rest

The rest of the pack consist of more esoteric, less generically applicable technologies. Graph databases like neo4j, and RAM caches like Redis come to mind. Many of these are fantastic technologies, but they are more specialized in their purpose and less likely to emerge as a general purpose replacement to your SQL database.

Documents Databases vs. Key-value stores

Key-value systems like Cassandra are fantastic distributed content stores and scale really well, but as a wise colleague of mine observed, they are technologies you might build  a database on top of, but they are not general purpose databases.

A key-value systems like Cassandra, that store opaque byte arrays as values, don’t know enough about the data stored therein to be able to provide much of a query language. You can index the blobs, and retrieve them via those indexes, but that is very much less expressive than what most need from a general-purpose database. Document databases, by contrast, have a rich set of value types that can be stored (numbers, strings, dates, arrays, nested objects, references, …), and the database system has an understanding of the structured data elements stored inside.

Modeling data in document databases is also a lot easier to think about. JSON is a very natural way to build data abstractions since it is so similar to the object oriented structures programmers build every day within Java, Python, Ruby, C# etc… By contrast I find that most straight key-value or columnar systems make you contort the way you think about your data in a way that is un-natural.

Here is what I think is one of the best articles out there on describing basic Cassandra data modeling concepts:

http://schabby.de/cassandra-getting-started/

Talk about over complicating something (click the link to see what I mean). Do NOT use this kind of kit for everyday stuff unless you like pain.

I would much rather model a user like this:

{
   id : 1234,
   email" : "foo@bar.com",
   address : {
      street : "123 ABC Way",
      city : "San Francisco",
      state : "CA"
      zip : 94114
   }
}

and do without sacrificing horizontal scalability (i.e. by using Couch or MongoDB).

CouchDB vs. MongoDB

CouchDB has some cool niche use cases in which it shines. It can be embedded into mobile devices, and has done a really good job with master-master replication. But for general-purpose use as a “MySQL replacement” it MongoDB is really what you want.

The two main comparison points (although there are more):

  • The only programatic interface to CouchDB is its REST API. MongoDB provides a driver in each client programming language that implements an efficient binary wire protocol. Result: Couch is MUCH slower than MongoDB (which blazes).
  • MongoDB has a general purpose query language like you have with a traditional SQL database (tuned via indexes). With CouchDB you define “views” specified by map-reduce functions.  If you want to query by ‘username’ you need to tell CouchDB ahead of time (a reader pointed out that you can create temporary views in CouchDB which is more like an ad-hoc query, but you are still writing a mini-program to define a view, and the whole thing feels like an after-thought).

Those two differences right there create a gulf that puts MongoDB is a category apart from CouchDB as the most complete document DB solution out there right now.

So what should I use MongoDB for?

I really think that if MongoDB existed ten years ago it (or something very like it) would store the majority of the types of data backing traditional “web apps” both consumer and enterprise. What is it great for?

  • Accounts / Users
  • Access control rules
  • CMS systems (i.e. content trees / graphs)
  • Web form data
  • Product catalogs
  • Blogs
  • System and application configuration
  • Session state
  • Logging

I’m sure there are other great uses.

Takeaway

There are many use-cases where a relational database and a traditional RDBMS are the right tool, but it is probably not your use case. Unlike ten years ago, there are a myriad of alternate database technologies available. It presents both opportunity and confusion. Know what you are trying to do, and really think about whether you need a low-level key-value store (are you implementing Youtube.com?). For most cases, you want a JSON document store, and MongoDB is probably the best out there for what you are building.

The reason behind the half-REST design pattern


Those of you plugged into the REST world have most likely seen at least one of Roy T. Fielding’s rants on “REST” implementations out in the wild (one of the best known being this one). While there are a number of reasons why an API might not technically fit the definition of REST, I have come to observe one incredibly commonplace “REST” design-pattern that is far and away the most common way in which APIs are only sort-of RESTful. I have also come to observe the reasons why (and they are fascinating… read on).

The vast majority of REST-like APIs I see follow a distinct pattern of being RESTful on reads (GETs) and not so RESTful with all other operations. Often you will see the following:

  • The API supports the HTTP GET method for lookups and searches supporting a variety of output formats such as XML, JSON, YAML, etc… Pretty RESTful so far.
  • The API utilizes the HTTP POST method for creating and, updating resources using POST name/value pairs (vs. via data in the request body). Not so RESTful.
  • The API does not support the HTTP PUT method. Typical PUT operations are implemented with POST as described above. Not so RESTful.
  • The API may or may not support the HTTP DELETE method. When it does not, delete operations are usually implemented with a specific “delete” URL, perhaps with query parameters, and the HTTP GET method.

In essence, the API is written such that reads (lookups and searches) are done in a REST style but writes (inserts and updates) are done in more of an RPC style.

Instead of a POST or PUT with something like the following in the request body:

 <contact>
   <first-name>John</first-name>
   <last-name>Smith</last-name>
   .
   .
   .
</contact>

we more often see a POST (or sometimes even a GET (ouch!)) that looks something like this:


http://.../add-contact?first-name=John&last-name=Smith&...

So the big question is why. Why do such an overwhelming number developers naturally drift towards API designs like we just described? Are all those developers really lame? Probably not, and if we examine the problem more carefully we can see at least two very compelling reasons why:

  1. The current architecture of HTML forms greatly encourages an RPC style API. Very often you would like to interact with an API via an HTML form. What’s easier, creating a form that submits a set of name/value pairs via POST (i.e. the standard way we use HTML forms), or writing Javascript code that bypasses the standard forms submission behavior and instead packages all form contents into a JSON object (or XML document) and POST/PUTs it to the server? No brainer. The second path has too much friction, and is much more work. Someone really should create a Javascript library to do just this. Maybe I will break down and write it eventually (and post it for y’all). I believe that this alone would help tremendously in the adoption of more RESTful API designs across the land. (If any reader knows of the existence of such a library please do let me know). Of course, all of this is in addition to the lack of support many browser have for the HTTP PUT and DELETE methods (but this is changing)
  2. It is currently much easier to support multiple payload formats (XML, JSON, YAML, etc…) using the RPC style. Implementing a real REST API would require server code to parse data from POSTs and PUTs in each supported format (XML, JSON, YAML, etc…). Too much work for most. If you write an API that (a) supports multiple formats on GETs and (b) implements POST/PUT with name/value pairs via HTTP POST, then you can say you support multiple formats with a straight face.

It is really the state of our software infrastructure, languages, and frameworks that encourage the web API designs we are seeing out there. We need to collectively do a little work to make it less painful to implement and use truly RESTful APIs by providing some client and server frameworks that make full REST just as easy as the half REST we are seeing out there.

Why is JSON so popular? Developers want out of the syntax business.


There is a reason why JSON is becoming very popular as a data exchange format (more important than it being less verbose than XML): programmers are sick of writing parsers! But “wait”, you say – “surely there are XML parsers available for you to use so that you don’t have to roll your own…”. Yes, there are. But while XML parsers handle the low-level syntactic parsing of XML tags, attributes, etc…, you still need to walk the DOM tree or, worse, build one yourself with nothing but a SAX parser (Objective-C iPhone SDK I’m looking at you!). And that code you write will of course depend on whether the XML you need to make sense of looks like this:

<person first-name="John" last-name="Smith"/>

or this:

<person>    
   <first-name>John</first-name>    
   <last-name>Smith</last-name> 
</person> 

or this:

<object type="Person">
   <property name="first-name">John</property>
   <property name="last-name">Smith</property>
</object> 

or any of the myriad of other ways one can conceive of expressing the same concept (and there are many). The standard XML parser does not help you in this regard. You still need to do some work with the parse tree.

Working with JSON is a different, and superior, experience. Firstly, the simpler syntax helps you avoid the need to decide between many different ways of representing your data (as we saw above with XML) – much less rope to hang yourself with. Usually there is only one straightforward way to represent something:

{
   "first-name" : "John",
   "last-name" : "Smith"
}

Even more important, if you are working in Javascript (which is very often the case when working with JSON), all you need to do is call eval on a JSON string to obtain a first-class Javascript object. This is huge. The subtle point here is that the output of an XML parser is a parse tree, not an object native to the programming language being used. With XML you are still dealing with syntax to a large degree. When you work with JSON you can go straight from a string representation to object (and back).

What makes this possible is that Javascript has syntactic constructs for describing composite data types literally. While virtually all languages have syntax for the literal description of objects of primitive types (integers (e.g. 5), strings (e.g. “hello world”)), not all languages have syntax for the literal description of objects of composite types. For instance, if you want to create a map in Java you need to do it procedurally:

Map m = new HashMap();
m.put("a", 1);
m.put("b", 2);
m.put("c", 3);
.
.
.

Java does not have literal syntax for maps. But languages such as Python and Javascript (and others) do. In Javascript we can define our map literally:

{ "a" : 1, "b" : 2, "c" : 3, ... }

As it turns out, such sub-languages are a great match for data interchange formats that are both human and machine readable.

So, it makes sense that JSON is so popular. At the same time, I don’t think JSON is the best or final incarnation of this concept, and I expect that, over time, other languages with similar properties will (re)emerge offering improvements over JSON (more on that in a later post).

As for XML… it just might not be the best for structured data interchange (even with some of the cool Object/XML mapping technologies out there). It works well for markup (i.e. HTML), and can be used for more structured data but over time I believe it will be supplanted by better technologies that are more like JSON and don’t require developers to walk parse trees. Developers should be free from the syntax business by now.

Why I wish Spring IoC was not marketed as a DI framework


I recently came across Guice, a framework that is widely considered an alternative / competing Dependancy Injection (DI) framework to Spring’s IoC container. After reading the documentation (which was very good), and playing around a little, I started to read the numerous articles blog posts comparing the two, as I was personally very surprised that there was even to be a comparison to be made.

Most comparisons I read centered around the use XML vs. annotations (although Spring does allow for the latter as well), and other distinctions that entirely miss the real observation that should be made. While Spring’s IoC container certainly does do dependency injection, it does so as a side-effect of doing something much more general, that no other DI framework I have seen does well: creating and configuring instances of Java objects. Sound like a boring and uninteresting statement? Well, this is a very subtle and powerful point whose ramifications are not obvious, but really should be understood in order to get the most out of Spring’s IoC container and change the way you program (for the better). Let me elaborate.

Most DI frameworks focus on saving you from having to write a lot of object factories to bind the concrete implementation classes to the interfaces exposed to and used by the rest of your codebase (i.e. bind the <code>PaypalBillingService</code> class to all places in the codebase where the <code>BillingService</code> interface is used). Spring’s IoC container does this too, and for many, this is all they use it for. But the real power of Spring’s IoC container is in its ability weave together complex graphs of Java objects and configure them with values.

Consider a simple example of a class meant to represent a JDBC configuration, and a goal of creating JDBC configurations for your development, qa, and production databases.

// I will use the non-compilable shorthand 'property TYPE NAME' to represent Java Bean properties and save me from
// writing getters and setters in this example

public class JdbcConfiguration {
    property String driverClassName;
    property String jdbcUrl;
    property String username;
    property String password;
}

I can then define beans for each configuration:

<bean id="myDevJdbcConfig" class="com.acme.JdbcConfiguration">
   <property name="driverClassName" value="com.mysql.jdbc.Driver"/>
   <property name="jdbcUrl" value="jdbc:mysql://devserver:3306/mydb"/>
   <property name="username" value="admin"/>
   <property name="password" value="admin"/>
</bean>

<bean id="myQAJdbcConfig" class="com.acme.JdbcConfiguration">
   <property name="driverClassName" value="com.mysql.jdbc.Driver"/>
   <property name="jdbcUrl" value="jdbc:mysql://qaserver:3306/mydb"/>
   <property name="username" value="admin"/>
   <property name="password" value="admin"/>
</bean>

<bean id="myProdJdbcConfig" class="com.acme.JdbcConfiguration">
   <property name="driverClassName" value="com.mysql.jdbc.Driver"/>
   <property name="jdbcUrl" value="jdbc:mysql://prodserver:3306/mydb"/>
   <property name="username" value="admin"/>
   <property name="password" value="admin"/>
</bean>

What’s interesting in this simple example is that I used Spring more to configure instances of a single Java class than to provide Dependency Injection in the way that other frameworks like Guice are primarily used.

But this example was somewhat trivial, so let’s kick it up a notch. At my last company, Merced Systems, our professional services team was able to implement incredibly complex customizations of our core platform for our customers using only configuration (no code) using an IoC container I wrote in 2001 very similar to Spring’s (enough like Spring’s that I will use Spring to illustrate).

Let’s say we have a simple ETL (Extraction, Translation, and Loading) framework for moving data from a source database to a target. You could use Spring to completely define an entire ETL process by linking together a set of Java Bean instances:

(I will omit class definitions as they will be obvious from the structure of the bean definitions)

<bean id="myETLConversion" class="com.acme.ETLConversion">
   <property name="source" ref="source">
   <property name="target" ref="target">
   <property name="mapping" ref="mapping">
   <property name="startTime" value="12:00am EST">
   <property name="frequency" value="DAILY">
   <property name="adminEmailForErrorAlerts" value="admin@fooco.com">
</bean>

<bean id="source" class="com.acme.ETLTableEndpoint">
   <property name="tableName" value="PERSON">
   <property name="jdbcConfig" ref="mySourceJdbcConfig">
</bean>

<bean id="target" class="com.acme.ETLTableEndpoint">
   <property name="tableName" value="PERSON">
   <property name="jdbcConfig" ref="myTargetJdbcConfig">
</bean>

<bean id="mapping" class="com.acme.ETLMaping">
   <property name="columnMappings">
      <list>
         <bean class="com.acme.ColumnMapping">
            <property name="sourceColumn" value="PERSON_ID"/>
            <property name="targetColumn" value="ID"/>
         </bean>
         <bean class="com.acme.ColumnMapping">
            <property name="sourceColumn" value="FIRST_NAME"/>
            <property name="targetColumn" value="FN"/>
         </bean>
         <bean class="com.acme.ColumnMapping">
            <property name="sourceColumn" value="LAST_NAME"/>
            <property name="targetColumn" value="LN"/>
         </bean>
      </list>
   </property>
</bean>

At Merced we used this technique to allow our customer services team to customize almost every single aspect of our product. In addiction to the (simplified) ETL example above, we used it for:

  • Defining Report table layouts and Chart configurations (like Bar Charts vs. Line Charts, font colors and sizes, etc…)
  • Defining the content and layout of Dashboards
  • Defining customizations of our DB schema (which we would then plugin to our ORM framework)
  • Customizations of our URL structure
  • Customizations of left-nav elements for different User roles
  • Access control rules
  • Internationalization and localization
  • More, I just cant even remember…

And… very importantly, for customizing aspects of the product the engineering team had never even anticipated. Because we had a development discipline of exposing almost every object in our codebase as a configurable Java Bean, our professional services group and customers were able to a accommodate numerous unanticipated customization requests without the need to change our codebase.  Did this mean a typical deployment of our system had hundreds of XML bean definitions? Yes. Could the configurations get very complex? Yes. Was it scary? No. It allowed us to deliver a highly customizable enterprise software product with ONE codebase, and we never had to support and rationalize Java code written by professional services, customers or outside integration shops. All customization was done via configuration and it was beautiful.

The point is, using Spring’s IoC framework to inject class dependencies as a substitution for class factories is just the beginning. It’s real power is in creating object graphs of components to drive the functionality and behavior of your system in ways that most think require code. Sure, it can result in a enormous amount of XML, and yes, the Java compiler can catch a lot more typos than Spring’s bean XML parser, but the less Java code you have, the better, because writing code causes bugs (even when you have a compiler keeping you type-safe).

I have come to realize that Spring’s IoC framework is often compared to DI frameworks like Guice and other’s because of it’s name. The term “Inversion of Control” is pretty much used interchangeably with “Dependency Injection”, and hence the comparisons. And so, I think Spring’s IoC framework suffers because of it’s name. It’s name does not do its power justice and results in naive comparisons to other frameworks. Maybe it should be called a Bean Configuration Framework, or Component Configuration Framework… not sure… but I wish it was not simply marketed, and therefore perceived, as a means for doing Dependency Injection just so that I might use a <code>PaypalBillingService</code> as my programs’s implementation of my <code>BillingService</code> interface. Its much more than that.

Java Generics – were they a good idea?


No. I think that all told, while implemented with the best of intentions, Java Generics were a bad idea. While they do provide welcome and useful functionality in some cases, overall the costs outweigh the benefits.

Cost

This new language feature and its use throughout the new Java 1.5 libraries have added a significant amount of complexity to the world of the Java developer. From the slightly counter intuitive (e.g. List<String> is not a subtype of List<Object>) to the profoundly metalicious (e.g. Enum<E extends Enum <E>>), generics can be subtle and difficult to fully understand for the average Java developer. Ask the average Java 1.4 programmer to read and explain this to you:

TypeVariable is the common superinterface for type variables of kinds. A type variable is created the first time it is needed by a reflective method, as specified in this package. If a type variable t is referenced by a type (i.e, class, interface or annotation type) T, and T is declared by the nth enclosing class of T (see JLS 8.1.2), then the creation of t requires the resolution (see JVMS 5) of the ith enclosing class of T, for i = 0 to n, inclusive. Creating a type variable must not cause the creation of its bounds. Repeated creation of a type variable has no effect. (http://java.sun.com/j2se/1.5.0/docs/api/java/lang/reflect/TypeVariable.html)

Yea.

Benefits

So what do we get for all this? The way I see it, there are really only two benefits to Java generics:

(1) They allow the compiler to catch some programming errors that would otherwise be detected at runtime as ClassCastExceptions (i.e. generics can prevent you from placing an Integer into a List of Strings).

(2) They make *certain* declarations more “self-documenting”. If I see a method signature that declares a parameter as a List of Strings (List<String>) it is much clearer to me how to use the method correctly and is more reliable than the documentation, if it even exists.

Net net

For me, it’s not worth it. So I have to cast more and perhaps catch some programming errors at runtime. In my experience the vast majority of ClassCastExceptions are the result of programming errors that you will catch very early while running and testing your code anyway. You fix the code and move on. Also, if you structure your error handling correctly, you should not need to wrap each cast in a try/catch block. Bottom line: the core problem generics aim to solve is not that bad of a problem.

As for (2) I would have to say that for every example of code that generics have made easier to understand there is at least one example where it has done just the opposite. Furthermore, in real-world programming with real-world class names method signatures become multi-line monsters that are quite ugly, hard to read, and a real burden to type (so much for the keystroke savings of not needing casts).

Adding language features are a big deal. You can’t undo them and they can have far reaching consequences. Java as a language, its documentation, the number of concepts its user has to manage, and the user’s learning curve have become much more complex for marginal benefit. Any addition to the language should have the effects of making programs easier to both read and write, making concepts simpler to express, and enabling the programmer to be more productive. If anything I would have liked to have seen the addition of multiple dispatch to Java (sometimes called generic methods (yes confusing) – perhaps more on this in another post).

Do I use it?

For the cases where I have no choice (grrr) and for the cases where it truly is a harmless perk (like Map<String, List>), yes, sometimes. Otherwise I avoid it.

Follow

Get every new post delivered to your Inbox.