Performance of a distributed Key Value Store or why Simple is Complex

Last time I talked about the key differences between RDBMS and the most important NoSQL databases. The key reasons why NoSQL databases can scale the way they do is that they shard based on the entity. The „simplest“ form of NoSQL database shows this best, the distributed Key/Value Store. Last week I had the chance to talk to one of the Voldemort developers at LinkedIn. Voldemort is a pure Dynamo implementation. We discussed its key characteristics  and we also talked about some of the problems. In a funny way its biggest problem is rooted in its very cleanness and simplicity and modular architecture.

Performance of a Key/Value Store

A Key/Value Store has a very simple API. All there really is to it, is a put, a get and a delete. In addition Voldemort like most of its cousins supports a batch get and a form of an optimistic batch put. That very simplicity makes the response time very predictable and should also make performance analysis and tuning rather easy. After all one call is like the other and there are only so many factors that a single call can be impacted by

  • I/O in the actual store engine on the server side
    This is plug able in Voldemort, but BerkeleyDB is the default
  • Network I/O to the Voldemort Instance
  • Cache Hit Rate
  • Load distribution
  • Garbage Collection

Both disk and network I/O are driven by data size (key and value) and load. Voldemort is a distributed store that uses Dynamos Consistent Hashing, as such the load distribution across multiple nodes can vary based on key hotness. Voldemort provides a very comprehensive JMX interface to monitor these things. On the first glance this looks rather easy. But on the second glance Voldemort is a perfect example of why simple systems can be especially complex to monitor and analyze. Let’s talk about the distribution factor and the downside of a simple API.

Performance of distributed Key/Value Stores

Voldemort like most distributed Key/Value Stores does not have a master. This is good for scalability and fail over but means that the client has a little more work to do. Even though Voldemort (and most of its counterparts) does support server side routing, usually the client communicates with all server instances. If we make a put call it will communicate with a certain number of instances that hold a replica of the key (the number is configurable). In a put scenario it will do a synchronous call to the first node. If it gets a reply it will call the remainder of required nodes in parallel and wait for the reply. A get request on the other hand will call the required number of nodes in parallel right away.

Transaction Flow of a single Benchmark Thread showing that it calls both instances every time
Transaction Flow of a single Benchmark Thread showing that it calls both instances every time

What this means is that the client performance of Voldemort is not only dependent on the response time of a single server instance. It actually depends on the slowest one or in case of put the slowest plus one other. This can hardly be monitored via JMX of the Voldemort instances. Let’s understand why.

What the Voldemort server sees is a series of put and get calls. Each and every one can be measured. But we are talking about a lot of them and what we get are moving averages and maximums via JMX:

Average and Maximum Get and Put Latency as measured on the Voldemort instance
Average and Maximum Get and Put Latency as measured on the Voldemort instance

Voldemort also comes with a small benchmarking tool which reports client side performance of the executed test:

[reads] Operations: 899
[reads] Average(ms): 11.3326
[reads] Min(ms): 0
[reads] Max(ms): 1364
[reads] Median(ms): 4
[reads] 95th(ms): 13
[reads] 99th(ms): 70
[transactions]  Operations: 101
[transactions]  Average(ms): 74.8119
[transactions]  Min(ms): 6
[transactions]  Max(ms): 1385
[transactions]  Median(ms): 18
[transactions]  95th(ms): 70
[transactions]  99th(ms): 1366

Two facts stick out. The client side average performance is a lot worse than reported by the server side. This can be due to the network or due to the fact that we have an average of the slower call every time instead of the overall average (remember we call multiple server instances for each read/write). The second important piece of data is the relative high volatility. Neither of the two can be explained by looking at the server side metrics!
The performance of a single client request depends on the response time of the replicas that hold the specific key. In order to get an understand of client side performance we would need to aggregate response time on a per key and per instance basis. The volume of statistical data would be rather large. Capturing response times for every single key read and write is a lot to capture, but more to the point, analyzing it would be a nightmare. What’s even more important is, the Key for the Key/Value Store alone might tell which key range is slow, but not why. It is not actionable.

As we have often explained context is important for performance monitoring and analysis. In case of a Key/Value store the context of the API alone is not enough and the context of the key is far too much and not actionable. This is the downside of a simple API. The distributed nature only makes this worse as key hotness can lead to an uneven distribution in your cluster.

Client Side Monitoring of distributed Key Value Stores

To keep things simple I used the performance benchmark that comes with Voldemort to show things from the client side.

Single Benchmark Transaction showing the volatility of single calls to Voldemort
Single Benchmark Transaction showing the volatility of single calls to Voldemort

As we can see the client does indeed call several Voldemort nodes in parallel and has to wait for all of them (at least in my example) to return. By looking at things from the client side we can understand why a certain client functionality has to wait for Voldemort even though server side statistics would never show that. Even more we can show the contribution of Voldemort operations overall, or a specific Voldemort instance to a particular transaction type. In the picture we see that Voldemort (at least end-to-end) contributes 3.7% to the response time of doit. We also see that the fast majority is in the put calls of the applyUpdate. And we also see that the response time of the nodes in the put calls varies by a factor of three!

Identifying the root cause of Key Hotness

There are two key issues that are hard to track, analyze and monitor with Voldemort according to a Voldemort expert. The one is key hotness. Key Hotness is a key problem for all dynamo implementations. If a certain key range is requested or written much more often than others it can lead to an over utilization of specific nodes while others are idle.

It is very hard to determine which keys are hot at any given time and why. If the application is mostly user driven it might be neigh impossible to predict upfront. One way to overcome this is to correlate End User Business Transactions with the triggered Voldemort Load and Response Time. The idea is that an uneven load distribution on your distributed Key/Value Store should be triggered by one of the three scenarios

  • All the excessive load is triggered by the same application functionality:
    This is pretty standard and means that the keys that you use in that functionality are either not evenly spread, monotonic increasing or otherwise unbalanced.
  • All the excessive load is triggered by a certain end user group or a specific dimension of a business transaction
    One example would be that the user group is part of the key(s) and that user group is much more active than usual or others. Restructuring the key might help to make it more diverse.
    Another example is that you are accessing data sets like city information and for whatever reason New York, London and Vienna are accessed much more often than anything else. (e.g. more people book a trip to these three cities than to anything else)
  • A combination of the two above.
    Either the same data set is accessed by several different business transactions (in which case you need a cross cut) or the same data structure is accessed by the same business transaction.

The key factor is that all this can be identified by tracing your application and monitoring it via your business transactions. The number of discrete business transactions and their dimensions (Booking per Location, Search by Category) is smaller than the number of keys you use in your Store. More importantly, it is actionable! The fact that 80% of the load on your 6 overloaded store instances results from the business transaction „Search Books – Thriller“ enables you to investigate further. You might change the structure of the keys, optimize the access pattern or setup a separate store for the specific area if necessary.

Identifying Outliers

The second area of issues that are hard to track down are outliers. These are are often considered to be environmental factors. Again JMX metrics aren’t helping much here, but taking a look at the internals quickly reveals what is happening:

Purepath showing the root cause of an outlier
PurePath showing the root cause of an outlier

In my load test of two Voldemort instances (admittedly a rather small “cluster”) the only outliers were instantly tracked down to synchronization issues within the chosen store engine: Berkeley DB. What is interesting is that I could see that all requests to the particular Voldemort instance that happened during that time frame were similar blocked in Berkley DB.

Seven were waiting for the lock in that synchronized block and the 8th was blocking the rest while waiting for the disk.

Hotspot showing where 7 transactions were all waiting for a lock in BerkleyDB
Hotspot showing where 7 transactions were all waiting for a lock in BerkleyDB
The Root Cause for the lock was that a delete had to wait for the disk
The Root Cause for the lock was that a delete had to wait for the disk

This issue happened randomly, always on the same node and was unrelated to concurrent access of the same key. By seeing both the client side (which has to wait and is impacted), the corresponding server side (which shows where the problem is) and having all impacted transactions (in this case I had 8 transactions going to Voldemort1 which were all blocked) I was able to pin point the offending area of code immediately.

Granted as a Voldemort User that doesn’t want to dig into Berkley DB I cannot fix it, but it does tell me that the root cause for the long synchronization block is disk wait and I can work with that.


Key/Value stores like Voldemort are usually very fast and have very predictable performance. The key to this is the very clean and simple interface (usually get and put) that does not allow for much volatility in terms of execution path on the server side. This also means that they are much easier to configure and optimize, at least as far as speed of a single instance goes.

However this very simplicity can also be a burden when trying to understand end user performance, contribution and outliers. In addition even simple systems becomes complex when you make them distributed, add more and more instances and make millions of requests to it. Luckily the solution is easy: Focus on your own application, its usage of the Key/Value store instead of the Key/Value Store itself alone.

Those who know me know that I'm passionate about 3 things: rock climbing, physics, and performance. I've worked in performance monitoring and optimizations in enterprise environments for the better part of the last 10 years. Now as a Product Manager I am doing my best to build those experiences into Dynatrace.