Most articles on the topic of NoSQL are around the theme of RDBMS vs. NoSQL. DBA’s are defending RDBMS by stating that NoSQL solutions are all dumb immature data stores without any standards. Many NoSQL proponents react with the argument that RDMBS does not scale and that today everybody needs to deal with huge amounts of data.
I think NoSQL is sold short here. Yes Big Data plays a big role, it is not the primary driver in all NoSQL solutions. There are no standards, because there really is no NoSQL solution, but different types of solutions that cater for different use cases. In fact nearly all of them state that theirs is not a replacement for a traditional RDBMS! When we compare RDBMS against them we need to do so on a use case basis. There are very good reasons for choosing an RDBMS as long as the amount of data is not prohibitive. There are however equally good reason not to do so and choose one of the following solution types
- Distributed Key Value Stores
- Distributed Column Family Stores
- (Distributed) Document Databases
- Graph Databases
It has to be said however that there are very simple and specific reasons as to why traditional RDBMS solutions cannot scale beyond a handful of database nodes and even that is painful. However before we look at why NoSQL solutions tend not to have that problem, we will take a look why and when you should choose an RDBMS and when you shouldn’t.
When and Why you (should) choose an RDBMS
While data durability is an important aspect of an RDBMS it is not a differentiator compared to other solutions. So I will concentrate first and foremost on unique features of an RDBMS that also have impact on the application design and performance.
- Table based
- Relations between distinct Table Entities and Rows (the R in RDBMS)
- Referential Integrity
- ACID Transactions
- Arbitrary Queries and Joins
If you really need all or most of these features than an RDBMS is certainly right for you, although the level of data you have might force you in another direction. But do you really them, let’s look closer?
The table based nature of RDBMS is not a real feature, it is just the way it stores data. While I can think of usecases that specifically benefit from this, most of them are simple in nature (think of excel spreadsheets). That nature however requires a relational concept between rows and tables in order to make up complex entities.
There are genuine relations between otherwise standalone entities (like one person being married to another) and relationships that really define hierarchical context or ownership of some sort (A room is always part of a house). The first one is a real feature, the second is a result of the storage nature. It can be argued that a Document (e.g. an XML) stores such a “relation” more naturally because the House Document contains the Room instead of having the Room as a separate document.
Referential Integrity is really one of the corner stones of an RDBMS, it ensures logical consistency of my domain model. Not only does it ensure consistency within a certain logical entity (which might span multiple rows/tables) but more importantly cross entity consistency. If you access the same data via different applications and need to enforce integrity at the central location this is the way to go. We could check this in the application as well, but the database often acts as the final authority of consistency.
The final aspect of consistency comes in the form of ACID transactions. It ensures that either all my changes are consistent seen by others in their entirety, or the none of my changes is committed at all. Consistency really is the hallmark of an RDBMS. However we often set commit points for other reasons than consistency. How often did I use a bulk update for the simple reason of increased performance? In many cases I did not care about the visibility of those changes, but just wanted to have them done fast. In other cases we would deliberately commit more often in order to decrease locking and increase concurrency. The question is do I care whether Peter shows up as married while Alice is still seen as unmarried? The government for sure does, Facebook on the other hand does not!
SELECT count(e.isbn) AS "number of books", p.name AS publisher FROM editions AS e INNER JOIN publishers AS p ON (e.publisher_id = p.id) GROUP BY p.name;
The final defining feature of an RDBMS is its ability to execute arbitrary queries: SQL Selects. Very often NoSQL is understood as not being able to execute queries. While this is not true it is true that RDBMS solutions do offer a far superior query language. Especially the ability to group and join data from unrelated entities into a new view on the data is something that makes an RDBMS a powerful tool. If you business is defined by the underlying structured data and you need the ability to ask different questions all the time than this is a key reason to use an RDBMS.
However if you know how to access the data in advance, or you need to change your application in case you want to access it differently, then a lot of that advantage is overkill.
Why an RDBMS might not be right for you
These features come at the price of complexity in terms of datamodel, storage, data retrieval and administration. And as we will see shortly a builtin limit for horizontal scalability. If you do not need any or most of the features you should not use an RDMBS.
- If you just want to store your application entities in a persistent and consistent way then an RDBMS is overkill. A Key Value Store might be perfect for you. Note that the Value can be a complex entity in itself!
- If you have hierarchical application objects and need some query capability into them then any of the NoSQL solutions might be a fit. With an RDBMS you can use ORM to achieve the same, but at the cost of adding complexity to hide complexity.
- If you ever tried to store large trees or networks you will know that an RDBMS is not the best solution here. Depending on your other needs a Graph database might suit you.
- You are running in the Cloud and need to run a distributed database for durability and availability. This is what dynamo and big table based datastores were built fore. RDBMS on the other hand do not well here.
- You might already use a dataware house for your analytics. This is not to disimilar form a Column Family database. If your data grows to large to be processed on a single machine, you might look into hadoop or any other solution that supports distributed Map/Reduce.
There are many scenarios where fully ACID driven relational table based database is simply not the best option or simplest option to go with. Now that we got that out of the way, let’s look at the big one, amount of data and scalability.
Why an RDBMS does not scale and many NoSQL solutions do
The real problem with RDBMS is the horizontal distribution of load and data. The fact is that RDBMS solutions can not easily achieve automatic data sharding. Data Sharding would require distinct data entities that can be distributed and processed independently. An ACID based relational database cannot do that due to its table based nature. This is where NoSQL solutions differ greatly. They do not distribute a logical entity across multiple tables, it’s always stored in one place. A logical entity can be anything from a simple value, to a complex object or even a full JSON document. They do not enforce referential integrity between these logical entities. They only enforce consistency inside a single entity and sometimes not even that.
This is what allows them to automatically distribute data across a large number of database nodes and also write them independently. If I were to write 20 entities to a database cluster with 3 nodes, chances are I can evenly spread the writes across all of them. The database does not need to synchronize between the nodes for that to happen and there is no need for a two phase commit, with the visible effect that Client one might see changes on Node 1 before Client 2 has written all 20 entities. A distributed RDBMS solution on the other hand needs to enforce ACID consistency across all three nodes. That means that Client 1 will either not see any changes until all three nodes acknowledged a two phase commit or will be blocked until that happened. In addition to that synchronization the RDBMS also needs to read data from other nodes in order to ensure referential integrity, all that happens during the transaction and blocks Client 2. NoSQL solutions do no such thing for the most part.
The fact that such a solution can scale horizontally also means that it can leverage its distributed nature for high availability. This is very important in the cloud, where every single node might fail at any moment.
Another key factor is these solutions do not allow joins and groups across entities, as that would not be possible in a scalable way if your data ranges in the millions and is distributed across 10 nodes or more. I think this is something that a lot of us have trouble with. We have to start thinking about how to access data and store it accordingly and not the other way around.
So it is true that NoSQL solutions lack some of the features that define an RDBMS solution. They do so for the reason of scalability. That does however not mean that they are dump datastores, Document, Column Family and Graph databases are far from unstructured and simple.
What about Application Performance?
The fact that all these solutions scale in principle, does however not mean that they do so in practice or that your application will perform better because of it! Indeed the overall performance depends to a very large degree on choosing the right implementation for your use case. Key/Value stores are very simple, but you can still use them wrong. Column Family Stores are very interesting and also very different from a table based design. Due to this it is easy to have a bad data model design and this will kill your performance.
Besides the obvious factors of disk I/O, network and caching (which you must of course take into consideration), both application performance and scalability depend heavily on the data itself; more specifically on the distribution across the database cluster. This is something that you need to monitor in live systems and take into consideration during the design phase as well. I will talk more about this and specific implementations in the coming months.
There is one other factor that will play a key role in the choice between NoSQL and more traditional databases. Companies are used to RDBMS, they have experts and DBAs for them. NoSQL is new and not well understood yet. The administration is different. Performance tuning and anlysis is different, as are the problem patterns that we see. More importantly performance and setup are more than ever governed by the applications that use them and not by index tuning.
Application Performance Management as a discipline is well equipped to deal with this. In fact by looking at the end-to-end application performance it can handle the different NoSQL solutions just like any other database, actually as we have seen in my last blog we can often do better!