Modern Application Performance Management (APM) solutions can be tremendously helpful in delivering end-to-end visibility into the application delivery chain: across all tiers and network sections, all the way to the end user. In previous blog posts we showed how to narrow down to various root causes of the problems that the end-users might experience. Those issues ranged from infrastructure through application and network, and through the end-user client application or inefficient use of the software. When the problem comes from the end user application, e.g., a Web 2.0 Web site, user experience management (UEM) solutions can offer broad analysis of possible root causes. Similarly, when an APM fault domain algorithm points to the application, the DevOps team can go deep into the actually executed code and database queries to identify the root cause of the problem.
But what do you do when your APM tool points to the network as the fault domain? How do you identify the real cause behind the network problems? Most of the APM tools stop there, forcing the network team to use separate solutions to monitor the actual network packets.
In this article we show how an Application-Aware Network Performance Management (AANPM) suite can be used to not only zero in on the network problems as the fault domain, but also dive deeper to show the actual trace of network packets in the selected context, captured back at the time when the problem happened.
Isolating Fault Domain to the Network
In one of our blog posts we wrote how Fonterra used our APM tools to identify the problem with SAP application used in the milk churn scanning process. The operations team could easily isolate the fault domain to network problems (see Fig. 1); they required, however, further analysis to identify the root cause behind that network problem.
Figure 1. The performance report indicates network problems as the fault domain ↩
In some cases information about loss rate or zero window events is enough to successfully find and resolve the problem. In general, finding the root cause may require to analyze more detailed, packet level views in order to see exactly what is causing this network performance problem. These details can not only help to determine why we experienced packet loss or zero window events, but also whether the problem was gradually ramping up or if there was a sudden flow control blockage, which would indicate congestion.
For example, a number of users start to experience performance degradation of the service and APM points to the network as the fault domain. The detailed, packet-level analysis can show that the whole service delivery process was blocked by failed initial name resolution.
So What Really Happened in the Network?
Why is detailed packet-level analysis so important when our AANPM points to the network?
Let’s first consider what happens when we determine fault domain with one of application delivery tiers. The engineers responsible for that application can start analyzing logs or, better, drill down to single transaction execution steps and often isolate the problem to the actual line of code that was causing the whole performance degradation of the whole application.
However, when our AANPM tells us it is the network, there are no logs or code execution steps to drill down to. Unless we can deliver conclusive and actionable evidence in the form of detailed, packet-level analysis, the network team might have a problem determining the root cause and may remain skeptical whether the network is at fault at all.
This is exactly what happened to one of our customers. An APM solution had correctly identified that there was a performance problem with the web server. The reports showed who was affected and where the users affected by that problem were located when the problem was occurring. The system also pointed towards the network as the primary fault domain.
The network team tried to determine the root cause of the problem. They needed packet level data for that. But, capturing all traffic with a network protocol analyzer after the incident happened not only overloaded the IT team with unnecessary data, but eventually turned out to be a hit and miss.
What the team needed were the network packets at the time the problem occurred, and only those few packets that related to the actual communication realizing affected transactions.
Figure 2. You can drill down to analyze captured network packets in the context of given user operations ↩
Using DC RUM, an AANPM tool from the Dynatrace suite, the IT team could start from the initially reported issues and request the packet capture within the context of affected user (see Fig. 2). The smart packet capture provided by DC RUM leverages the Endace Network Visibility solution to retrieve historical network packet data (see Fig. 3).
Figure 3. Initiate and track the process of retrieving relevant network packets from the Endace infrastructure ↩
The importance of collecting at a user level was revealed when the team analyzed the retrieved network packets together with all interactions connected with affected transactions. It became clear that the root cause of that issue was truly not with the web server, but with the name resolution process itself.
When Do You Need Actionable Network Trace Data?
Not all application performance problems require deep, network packet-level analysis. It is worth knowing when the IT team might need actionable network trace data to get to the bottom of the issue. Some DC RUM metrics that might indicate packet level data is required include:
- Low Server Realized Bandwidth – this metric indicates low throughput as the server attempts to transfer a response to a client. This may be caused by a number of problems (generally considered “network” delays): TCP windowing, TCP receive window, TCP congestion control caused by dropped packets, network congestion, and … slow server. You should keep an eye on that metric as it is sometimes the only one that noticeably degrades when user performance suffers. You will need, however, network packet data to understand the actual reason behind this problem.
- High level of TCP errors and Loss Rate – both metrics are based on the count of errors, either connection refused errors or retransmissions. There is only a weak correlation with the actual end-user experience, as those network problems might or might not affect the end user. You need network packet data to understand and quantify the performance impact on the transaction time.
- General Network Fault Domain Indication (FDI). From the perspective of slow operations affecting end-user experience, FDI measure may point to the network, but there may be no obvious out-of-tolerance network metrics. Network packet data view is indispensable in order to either isolate the network constrain, or to absolve the network of blame.
In this post we showed that in order to depend on your fault domain, in some cases to find the root cause of the problem, you will need to provide low level, actionable data. In case of application performance management, those data would consist of measures inside the application source code along the application execution path. When it comes to network problems we need a similar level of details; those can be delivered by smartly capturing network packets in the context of network conversation among services realizing application experiencing performance problems.
(This blog post is based on materials contributed by Mike Hicks based on the original customer story. Some screens presented are customized while delivering the same value as out of the box reports.)