Jorgen’s Weblog: Modern Databases

I finally understood the point of the so-called NoSQL databases. I have for a long time put them down as a buzzword hype thing, but it turns out there are good technical reasons for them. The hype part is that they’re supposed to replace relational databases. That’s wrong. But they have a purpose.

Let me explain.

File Systems

Historically, applications have used the file system as their primary database. A file system can be described as a key/value store with hierarchical keys and free-form values. This format has a few advantages and a few disadvantages.

For example, the API is well-defined and easily available. When used the way it is designed, it’s also extremely fast, and has no problems with very large data blobs.

On the other hand, the lack of structure in the values really bites for any more complex uses. Every application has to implement serialization and deserialization of the values separately, it’s not part of the API. Also, file systems have very bad support for concurrent writing. The file locking support is very limited, and atomic operations are quite rare. Some of these problems can be worked around (for example, writing data to a temporary key and then using rename to create the actual key in a single atomic operation).

Where file systems really stink is networked operations. There is simply no good solution for applications on different hosts using the same file system. (NFS, SMB, SSHFS? Come on.)

Hence …

Relational Databases

A lot of the problems file systems have are solved by relational databases. They provide structured data, a very powerful API, are robust against concurrent uses and have good support for networked access. And operations they were designed for are a lot faster than on file systems.

But they, too, have problems. For example, support for large blobs is generally vendor-specific, if it exists at all. This goes so far as to a widespread recommendation to store files on the file system and only file names in the relational database.

Random access to individual keys is generally slower than on file systems. And they also impose quite some restrictions on how you can structure your data.

Can we combine some of the advantages of these two? Why, yes!

NoSQL

As the no in NoSQL implies, these databases are not really defined by what they do, but by what they don’t do. They’re not relational. Even the subtypes vary a lot.

In-memory key-value stores like Redis or Memcached for example provide a filesystem-like key/value store that’s optimized for speed and networked access, at the expense of reliability. They’re basically temporary file systems, for data that’s good to have around but where it’s not exactly a complete tragedy if it’s lost. Caching is an excellent example. They also provide a richer API for atomic operations, like an increment operation for integer fields to avoid the archetypal concurrent access pitfall.

Document-based databases like CouchDB, MongoDB or ElasticSearch on the other hand focus on rich structured data and an API that can work with the structure.

Summary

None of these is strictly better than the others. They all have various strong and weak points. What the so-called NoSQL revolution (terrible name) did was to move us beyond the holy duality of storing data either in a relational database or in the file system. Each one of them has advantages and drawbacks, and depending on the requirements for your application, you might pick one of them over the others.

But at least you now have a choice.

Jorgen’s Weblog

Own JS/CSS options

Sunday, September 29, 2013

Modern Databases

File Systems

Relational Databases

NoSQL

Summary

Further Reading