I’ve been offering my help in analyzing performance data for quite a while now. Most of the time when analyzing PurePaths, load testing outputs or production log files, I find very similar problem patterns. This fact inspired us to automate problem detection in Dynatrace AppMon 6.5. Even though I think we cover a big part of common patterns, I am always on the lookout for something new – something I have not previously seen or, at the very least, something not seen very often. Something like the N+1 Asynchronous Thread Problem Pattern.
In early December I was invited by our Gdansk, Poland engineering team to visit their lab. Besides kicking off a cool new Meetup group called Dynatrace TechShare, some coaching sessions, and internal presentations we dedicated two workshops to my “Share Your PurePath” program. It was great to see that our team in Poland is using our products in the same way we expect our users apply them: early in development, automated in the pipeline as well as in production monitoring.
A big thank you to Damian and Marcin who allowed me to have a look at the data they captured when analyzing performance from DC-RUM, as well as our AppMon SaaS Orchestration Platform. In addition to several of the “usual suspects”, we found an example in Marcin’s AppMon data that I haven’t seen very often: N+1 Query Pattern for Asynchronous Threads.
Here is the main PurePath we looked at with some explanation on which columns to look at and what the data tells us:
Which problem did they try to solve?
When we identified that issue we discussed why they chose this implementation. The particular transaction here is supposed to deliver the status of all actively running Dynatrace AppMon instances that we host in Amazon EC2. The result is then visualized by the Web UI with a nice tile for each instance and the status information.
The initial implementation was all done in the main thread of the servlet container. The initial SQL Statement was executed that returned the list of IDs of the active Dynatrace instances in the given timeframe that was passed as HTTP parameter. Then they iterated through that list in a loop and queried the status of each Dynatrace AppMon instance. This resulted in the “Classical N+1 Query Issues” towards the database.
This implementation was good in the early days because we only had a handful of instances running in EC2. The more environments we onboarded the longer that list grew and, therefore, this implementation started to get slower and slower. Classical data-driven performance issue!
In order to speed this up they decided to simply execute the “Query the Status for an Instance” in a separate thread resulting now in the pattern we found: “Threading N+1 Query Problem”.
Additional problems detected with this approach
The original N+1 Query Problem wasn’t really addressed, it was just “outsourced” in asynchronous threads which meant that the same SQLs are still not executed in sequence – but always five buckets are executed in parallel as they had five worker threads working on these worker items. The Dynatrace Database Dashlet indicates this clearly:
From the PurePath we can also see that the implementation didn’t use a shared database connection pool with a set of connections. Every SQL statement was executed on a generic connection, which means that the code couldn’t leverage any client-side result data caching, nor could they reuse existing open database connections.
Because we learned that every request to that particular overview page will eventually consume the main servlet thread and five worker threads, I recommend always looking at two key metrics and correlate them: Number of Concurrent Threads vs. Incoming Requests.
The chart above shows us that there is a certain “base level“ of threads that are always executing. We also see these spikes. These spikes are caused by specific requests to certain features that exactly implement the “N+1 Thread Pattern”. We could therefore change that graph and instead of only charging the total amount of requests we could split it into Requests by URI or Requests by Feature.
The fact that they only have a limited number of backend worker threads also explains why, in our initial PurePath, we saw a 47s gap until the first background thread executed. If you have multiple parallel requests coming in, and all of them are competing for a handful of background threads, they ALL have to wait for them to be available. That can take some time!
Tips and Tricks
One of my favorite charts in Dynatrace is the Layer Breakdown. It gives me an overview of Response Time split by logical layers of my application. In the following chart I am just viewing a short time frame – but – the spike in the end occurred when we tried to execute that particular feature that had the treading and SQL issue. The chart already highlights JDBC and our own Package as the hotspots:
The greatest thing of all is that we do not even have to go through most of the steps I highlighted here. Dynatrace AppMon automatically detects these patterns. While we do not yet call it the “N+1 Thread Pattern” we already detect “Heavy Threading” and “N+1 SQL Query” pattern. Simply open the PurePaths view in the Web UI, filter on these patterns, and explore those PurePaths experiencing these issues. This will also work well with your application:
Next Steps: Optimizing the Implementation
I am very pleased my colleagues are open to sharing this story. Why? Because, while nobody is perfect, we have the tools to show us that we can optimize. In our meeting we discussed several options to improve this. Obviously the best approach would be to eliminate the initial N+1 Query Problem by simply executing a single SQL statement querying the status of all 128 Dynatrace Server Instances in one roundtrip, instead of retrieving the same data it with 128 individual queries.
This can then also be done in the main servlet thread because it will be much faster. It will also reduce the number of threads involved, giving us more threads for other incoming requests.
Marcin and team told me they will share their improved version so that we can then execute a “before/after” comparison, which I eagerly anticipate. A first immediate change they attempted was to double the number of available worker threads. While that obviously doesn’t solve the root of the problem it at least improved performance significantly. Keep in mind though that this only works if load is not too high. If you have a high load it simply means that you keep more threads busy all the time:
If you want to give this a try, obtain your own Personal License of Dynatrace AppMon & UEM by completing this form: http://bit.ly/dtpersonal. If you have PurePaths that you want to have analyzed — because Dynatrace doesn’t tell you what’s wrong automatically — feel free to share it with me through my Share Your PurePath program.