At Dynatrace, we live and breathe application performance. This includes our development process; the dev teams routinely use our own software to test application code for performance at every development cycle. But it doesn’t mean we don’t run into surprises.
Recently, one of the Dynatrace NAM development teams experienced such a surprise. Early one morning, automated performance tests raised a red flag: system capacity had dropped significantly between two CI builds.
The performance problem
The subject system is the Dynatrace NAM console that controls NAM report servers (still wondering what’s in a NAMe?). Technically speaking, it’s a multi-tier web application where the user (the NAM administrator in this case) invokes commands against the control tier (the NAM configuration server), which manages multiple controlled nodes (NAM report servers) using web APIs. Communication between controller and nodes uses REST calls, secured with TLS 1.2.
The team quickly cleared the first suspect – recent code commits – as there were no relevant changes. Suspect two was more promising; a Java upgrade on part of the end-to-end system. Here, they found a smoking gun: a parallel test environment, still running on a previous Java version, did not experience any performance degradation. The investigation stalled, however, as the code executed with the same performance on both the old and new JVMs.
Dynatrace AppMon showed clearly that there were no differences in CPU cycles spent in our code on those two JVMs. So who’s to blame?
Must be the network
If it’s not the code, then it must be the proverbial network! So the dev team – who develops Dynatrace NAM – instrumented the environment with NAM, looking for differences in how the two systems communicate. This process mirrors production performance analysis; internal server and code bottlenecks are often easy to isolate – if they exist. Otherwise, an outside-in view of the end-to-end system is important to identify more obscure performance anomalies. NAM’s wire data analytics provide exactly that perspective.
As the team looked for network contingencies, they found that the JVM upgrade on one of the hosts within the end-to-end system broke a best practice of TLS connection maintenance: when possible, use short, and therefore more efficient, TLS handshakes. The upgraded JVM no longer used these, instead requiring full TLS handshakes for every network connection. This subtle change significantly degraded overall system performance. Read on to see how they found out.
Starting with a comparison of response times from controlled nodes, investigation confirmed that significantly higher response times occur within the suspect system.
The team learned immediately that network quality was not an issue. In fact, NAM reports showed that the faster-running system, console-db2, experienced slightly poorer network quality between the controller server and the cluster nodes. This was still insignificant to the system’s end-to-end performance.*
What mattered most to the performance difference was the type of operations were performed within the monitored systems.
Clearly, these two systems that are supposed to work the same way, that execute the same application code, are behaving differently. Console-db2 system uses short TLS handshakes that are 8x faster than full TLS handshakes. A look at the systems operation details side-by side reveals that the type of handshake indeed matters to overall system performance experience.
The API connections between the console and the report server require a TLS handshake about once every 10 information exchanges. That’s a safety precaution and an expected cost of security overhead. What matters is the kind of handshake. Negotiation a new TLS session with a full TLS handshake is far more computationally expensive than re-using a previously negotiated key in a new secure session. NAM shows that full handshakes are on average 8x slower than the consent to reuse the previous session key (short handshake).
This turned out to be the culprit: the Java upgrade on console-db1 caused an incompatibility between TLS stack configurations on the NAM console server and controlled cluster nodes. This incompatibility forced full TLS handshake to be performed every time a TLS handshake was needed.
What does a handshake cost?
Performance measurements reveal that on average, a full TLS handshake alone takes 4x longer than the actual data exchanges over the secured connection! Application developers can’t ignore this overhead in the application delivery channel; it can all too easily nullify even the best efforts to make the app code perform well. Short TLS handshakes improve the situation a lot; these take less time than the data exchange, though they’re still long enough to correspond to typical REST call response times.
An application can’t be declared “performant” until it’s performing well in the eyes of the user. That requires consideration of the delivery channel, comprised of the network, encryption, Citrix, WAN optimization, etc. – all the elements that sit between your code and its consumer. Measuring code response time at an app server entry point doesn’t build a complete picture without including the complexities of the network and the application delivery channel. Dynatrace Network Application Monitoring measures and quantifies exactly that. In our example, identifying the problem would have been difficult – hit or miss – without NAM’s insights into TLS connection setup times, cipher strengths, connection errors and effective throughput.
TLS 1.3, an upcoming standard set to replace the TLS 1.2, proposes a new, shortened TLS handshake algorithm, for the reasons we uncovered here: make it faster and lighter. Similar TLS connection optimizations are also a cornerstone of the emerging HTTP/2 standard.
*Let’s broaden the perspective: what if our TLS connections were established over the WAN or the Internet?
Our case benefits from a second thought: here we had a situation that occurred entirely within the data center, where network latencies are very low – a few milliseconds or less. Therefore, we hardly ever saw network time forming a significant part of overall transaction time. With minimal network delays, the computational bottleneck incurred by the misconfigured TLS stack quickly became clear.
However, if the same REST API communication occurred over a WAN or the Internet, where latencies measured in tens or hundreds of milliseconds are not uncommon, different bottlenecks would have amplified the problem. Full TLS handshakes require multiple packet exchanges, and the volume of data transferred requires multiple packets in each data exchange. What does it mean for the connection delay – just try to vertically expand the TLS connection setup bounce diagrams from the illustrations and see how increased network latency has a multiplier effect on the connection setup time.
Handshake matters even more over long distances
High-latency network paths would cause each new TLS connection to take much longer, perhaps even seconds – quickly masking the processing overhead. The WAN latencies are more common than many of us think – and not because of poor network quality, but rather WAN optimization complexity. Have look at Why WAN optimization requires performance management blog to see an example of the real-life challenges we have to face.
As our industry moves steadily towards increased security, you can expect the unexpected.