Building an application performance monitoring offering or solution takes a lot of effort; including developing the actual monitoring and reporting application. But making APM awesome takes even more effort and credibility. That is why, when previously known dynaTrace joined Data Center RUM (DC RUM) under the Dynatrace umbrella we decided to check (and possibly improve) DC RUM’s performance even further using dynaTrace instrumentation.

Drinking Champagne

In this article we show how the DC RUM development team has instrumented the DC RUM offering with dynaTrace and what were the most interesting observations we made. We found several problem patterns that we addressed to improve performance of the individual components of DC RUM, hence delivering better performance to our end users.

Preparing for Instrumentation

DC RUM is an enterprise-class application that in order to deliver state of the art APM reporting to the end user. Behind the scenes it has a multi-tier architecture including one or more network probes (AMD), data analysis servers (CAS and ADS), administrative service (RUM Console) and security service (CSS). Figure 1 shows example Transaction Flow diagram for a simple DC RUM deployment showing all tiers and how they interact with each other.

DCRUM-transaction-flow

Figure 1. Example transaction flow diagram for simple DC RUM setup 

When we first instrumented DC RUM components by simply attaching dynaTrace Agents three things became apparent:

  1. Not all operations were picked up by the Dynatrace PurePath technology. Since some data processing routines are invoked by the internal scheduling service and not through service calls, those were not used to start new PurePaths out of the box.
  2. Because we used an in-house implemented thread pooling mechanism not all PurePaths were traced across all tiers. We used the extension mechanisms provided by the dynaTrace Agent to support our custom implementation of asynchronous thread execution achieving full end-to-end transaction traces for all of our transactions between CAS and ADS.
  3. Looking at exceptions reported in PurePaths, one could believe the service was constantly out of control. In fact, parts of DC RUM are written in Groovy and some other use a JSON library for processing; both extensively throw exceptions to control processing flow. In order to see the wood among the trees we had to actually silence both (see Fig. 2)

exceptions-exclusions

Figure 2. Excluding exceptions thrown by Groovy and JSON to focus only on the actual problems 

Problem Pattern #1: Unnecessary Database Calls for Status Info

Probably the single most crucial function of the DC RUM system from the perspective of user experience is reporting delivered by the DMI (Data Mining Interface) service. Delays in rendering DMI reports can accumulate when following the Fault Domain Isolation (FDI) algorithm in the form of DMI reports workflow. DC RUM usually has to deal with enormous amounts of data that has to be crunched, aggregated and eventually rendered to the user in fractions of seconds to ensure flawless execution of the FDI algorithm.

DMI reports are rendered by the LSServlet servlet. That is why we started our analysis from looking at the /LSServlet Entry Points in dynaTrace and drilling down to corresponding PurePaths.
Most of the time required to render a DMI report is usually spend on accessing and handling data. In case of the most popular reports these data should not come from the database but rather from in-memory caches. Therefore, when analyzing performance of DMI we usually check database calls looking for, e.g., those that should not be there.

Status Icon Calls Db

Figure 3. Finding root cause of unnecessary calls to DB when rendering DMI reports 

When analyzing Database dashlet, to which we drilled down from one of /LSServlet PurePaths (see Fig. 3), we notice a number of DB calls that have nothing to do with getting data required to render the report. We further drilled down to the nodes in PurePaths Tree where these calls are made. They were in fact invoked in separate threads which were traced back (thanks to “stitching” done with dynaTrace ADK, which I wrote about in previous section) to the calls of the getCommonStatuses() method. These methods are implemented by a number of plugins, which deliver information shown in the tooltip of status icon rendered next to the report title (see Fig. 4).

Status Icon Example

Figure 4. Example of a status icon rendered next to the report title 

Information delivered by these methods should be instant to ensure no delays in delivering report to the user. Hence, calls to database, invoked by one of the implementations, were an obvious mistake. The only reason it was not discovered before was the fact that these calls were very quick; still, they could be a potential point of delayed delivery of DMI report to the user.

As a solution to this issue we changed the implementation of this plugin to update the status every 5 minutes and keep it in the memory cache, so that subsequent calls to that method would not invoke database calls.

Problem Pattern #2: Unnecessary, Repetitive Data Queries for Report Rendering Table Report

In the previous section we showed how dynaTrace helped us track down a non-optimal implementation of a plugin for component showing current status of the system. We continued our analyses by looking at which methods are called when generating a DMI report. Maybe there is some unnecessary processing that the DMI report rendering engine could do without?

Frequent Get Result Column Calls

Figure 5. The getResultColumns() method is called too many times per one report rendering 

Figure 5 shows Methods dashlet with methods called to realize /LSServlet PurePath. Methods are sorted by number of times they are called. We were analyzing the PurePath for rendering Applications Report (see Fig. 3), which consists only of 2 sections. What is surprising is that methods from class GetResultColumns would be normally called once per-section to get rendering specification of all columns present in the given section. As we discovered by looking at the PurePath Tree, those methods, however, were called for every column in each section rendered in this report. In case of reports that required data directly from database instead of the cache, this implementation was not only introducing unnecessary delay in rendering a report, but also generating extra, repeated calls to database, which should be otherwise avoided.

Problem Pattern #3: Unnecessary Wait

DC RUM is not only the reporting engine delivered by DMI upon user request. The CAS and ADS servers, which use DMI to render reports, are at all times busy processing data generated by the network probe (AMD). Since this processing happens behind the scenes, users might not be usually aware of performance problems those servers might have. These problems are also hard to spot in the development process, unless you have a tool that can give you better visibility in your application performance.

Since data processing in CAS/ADS is invoked by internal thread scheduler instead of external (Web) calls, we had to additionally instrument the DataReceiveManager.wakeUp() method to start the PurePath (see Fig. 6). This method is called each time CAS/ADS starts to process new data delivered by AMD.

Instrumenting CAS Processing Scheduler

Figure 6. The DataReceiveManager.wakeUp() method is instrumented to start new PurePath 

As expected, the DataReceiveManager.wakeUp() method shows up at the top of the Entry Points dashlet when sorting by CPU Total time (see Fig. 7). When analyzing PurePath Hotspots for various DataReceiveManager.wakeUp() PurePaths we have noticed that often there were class to methods that lasted exactly 2 seconds.

2sec Sleep

Figure 7. Analyzing PurePath Hotspots shows 2 second sleep() method calls when processing data from AMD. 

Further investigation revealed that some of the sleep() method calls were waiting exactly 2 seconds. We analyzed the source code of these methods. As the processing of data is being done by multiple threads in parallel, the main thread waits for the results. The waiting loop was setup to sleep() for 2 seconds if results were not ready. In some cases processing was taking longer than 2 seconds, especially few years ago when this code was written. But currently in many cases processing threads were finishing earlier.

SleepWhileRunning Diff

Figure 8. Introducing progressive sleep time to the main thread waiting procedure 

As a solution (see Fig. 8) we have introduce progressive incrementing of the sleep time. With this solution, the main thread can respond much earlier to the results of those threads that had finish much earlier than in 2 seconds.

Conclusions

By instrumenting DC RUM with dynaTrace the development team introduced into their regular activities a powerful process to continuously improve quality of DC RUM in areas that are harder to spot otherwise. What the DC RUM development team appreciated in dynaTrace instrumentation is that not only was it the first tool that gave the real overview of this complex system. The dynaTrace PurePath technology also allows you to dive deep into the operations executed in the real life scenarios. Unlike typical profilers, dynaTrace enables you to monitor a running system without actually affecting its performance and designing focused special testing scenarios prior to starting the system.

As a by-process of further improving DC RUM, the team also delivered an extension that uses dynaTrace ADK to tag processed threads and sensor packs which can be used by our clients, who also own dynaTrace licenses, to monitor their DC RUM infrastructure and provide more details in their support calls.


(This blog post is based on materials contributed by Andrzej Lecki from the DC RUM development team. Some screens presented are customized while delivering the same value as out of the box reports.)