This posting illustrates how to effectively automate shared infrastructure analysis to support both your back-end jobs and your applications.
Last week Gerald, one of our Dynatrace AppMon super users, sent me a PurePath as part of my Share Your PurePath program. He wanted to get my opinion on high I/O time they sporadically see in some of their key transactions on their Adobe-based documentation management system. The hotspot he tried to understand was easy to spot in the PurePath for one of the slow transactions: Creating a local file took very long!
In order to find out which file takes that long to create I asked Gerald to instrument File.createNewFile. This was now capturing the actual file name for all transactions. Turned out we are talking about files put into a temporary directory on the local D: drive.
Now – this itself didn’t explain why creating these files in that directory was slow. As a next step, we looked at process and host metrics of that machine delivered by the OneAgent. I wanted to see whether there is anything suspicious.
The Process Health indicated that there is some very aggressive memory allocation going on within that JBoss. Multiple times a minute the Young Generation Heap spikes to almost 5GB before Garbage Collection (GC) kicks in. These spikes also coincide with high CPU Utilization of that process – which makes sense because the GC needs to clean up memory and that can be very CPU intensive. Another thing I noticed was a constant high number of active threads on that JBoss instance which correlates with the high volume of transactions that are actively processed:
Looking at the Host Health Metrics just confirmed what I’ve seen in the process metrics. CPU spikes caused by high GC due to these memory allocations. I blamed the high number of Page Faults on the same memory behavior. As we deal with high Disk I/O I looked at the disk metrics – especially for drive D:. Seems though that there was no real problem. Also, consulting with the Sys Admin brought no new insights.
As there was no clear indication of a general I/O issue with the disks I asked Gerald to look at file handles for that process and on that machine. Something had to block or slow down File I/O when trying to create files in that directory. Maybe other processes on that same machine that are currently not monitored by Dynatrace or some background threads on that JBoss is having too many open file handles leading to the strange I/O waiting times of the core business transactions?
Background Scheduler to be blamed!
Turned out that looking into background activity was the right path to follow! Gerald created CPU Sample using Dynatrace on that JBoss instance. Besides capturing PurePaths for active business transactions it is really handy that we can also just create CPU Samples, Thread Dumps or Memory Dumps for the process that has a Dynatrace AppMon Agent injected.
The CPU sample showed a very suspicious background thread. The following screenshot of the Dynatrace CPU Sample highlights that one background thread is not only causing high File I/O through the File.delete operation. It also causes high CPU through Adobe’s WatchFolderUtils.deleteMarkedFiles method which ultimately deletes all these files. Some “Google’ing” helped us learn that this method is part of a background job that iterates through that temporary directory on D. The job tries to find files that match a certain criterion, marks them, and eventually deletes them.
A quick chat with the Adobe administrator resulted in the following conclusions:
- The File Clean Task is scheduled to run every minute – probably too frequent!
- Very often this task can’t complete within 1 minute which leads to a backup and clash of the next clean up task
- Due to some misconfiguration, the cleanup job didn’t clean up all files it was supposed to. That lead to many “leftovers” which had to be iterated by the Watch Utility every minute leading to even longer task completion time.
The longer the system was up and running, the more files were “leftover” in that folder. This led to even more impact on the application itself as file access to that folder was constantly blocked by that background job.
Dynatrace: Monitoring redefined
While this is a great success story it shows that it is very important to monitor all components that your applications share infrastructure with. Gerald and his team were lucky that this background process job was actually part of the regular JBoss instance that ran the application and that it was monitored with a Dynatrace AppMon Agent. If the job would be running in a separate process or even on a separate machine it would be harder to do root cause analysis.
Exactly for that reason we extended the Dynatrace monitoring capabilities. Our Dynatrace OneAgent is not only monitoring individual applications but additionally automatically monitors all services, processes, network connections and its dependencies on the underlying infrastructure.
Applying artificial intelligence on top of that rich data allows us to find such a problem automatically without having experts like Gerald or his team perform forensic steps.
The following screenshot shows how Dynatrace packages such an issue in what we call a “Problem Description” including the full “Problem Evolution”. The Problem Evolution is like a film strip where you can see which data points Dynatrace automatically analyzed to identify that this is part of the root cause. The architectural diagram also shows you how your components relate and cross impact each other until they impact the bottom line: which is the end user, performance or your SLAs.
If you are interested to learn more about Dynatrace, the new One Agent as well as Artificial Intelligence simply try it yourself. Sign up for the SaaS-based Dynatrace Free Trial.
If you want explore Dynatrace AppMon go ahead and get your own Dynatrace AppMon Personal License for your On-Premise installation.