Hello! My name is Eric Proegler, and I’m privileged to provide a guest blog post. I am a performance tester by background, but I’ve been learning more about Dynatrace recently and have an experience to share.
I work for Forsythe Solutions Group in our Managed Services division. We’ve been using Jira from Atlassian as a ticketing application for years. Jira runs as a Tomcat process with a MySQL backend, and has a robust community of add-ons for things like workflow. We use several.
As we gear up to open our new retail datacenter, we’ve enhanced and upgraded our Jira instance. There have been a couple of hiccups, and we’ve had anecdotal reports of occasional poor performance. Our engineers use this critical system all the time, and we are bringing more teams on to it, so it looked like a good choice for an application to start Dynatracing with.
Each of our engineers has a “Morning Notes” ticket, where they document their planned work each day for review on our daily operations call. They log the ticket handles of the planned work, along with some notes about each work task.
The daily entries from these tickets are pulled into a status page along with completed tickets, tickets due/overdue, and summaries of time entered for each day, displayed for each engineer. The entries are parsed so that each ticket handle can be hyperlinked, allowing users to click through to ticket details. This is the main source of team status and everyone uses it, particularly as we prepare for our morning operations call.
Ticket Load Times ranging from 7- 40s
If your Dyna-Sense is tingling (and telling you this sounds like a lot of database work), you are correct. This status page loads in 3-5 seconds under good conditions. When I first looked at this – conditions were not good. Responses exceeding 15-20 seconds with some timeouts reported; even with engineers, the reports were vague. During the 30 minutes before the operations call when everyone was updating, I saw unpredictable performance as well, measuring consecutive response times ranging from 7 to 40 seconds.
I started with looking at longest Response Times for PurePaths, and could see a cluster of Jira ticket requests (for opening a ticket) taking as long as 24 seconds. Looking at the Transaction Flow for the longest running one, there were 15,837 calls to the DB and 30 seconds of DB time recorded.
Excessive Database Access for opening a single ticket
Drilling down to Database Dashlet, I took a look at what these 15,837 queries were.
The first things I noticed are zero failures and no long-running queries; notice that Exec Max is 66ms. The issue here seems to be the number of calls, not bad queries. But why so many? Let’s look at those.
SELECT 1 is a common mechanism for testing a MySQL database connection before issuing a query. It will cause the database engine to return “1” without actually looking up any data. It appears there is one of these for each “real” query (7,917 *2 = 15,834). Perhaps we should consider the high number of executions a peek into what can happen in abstracted/interpreted code.
Next, I noted that the average for the three most commonly issued queries are barely any longer than the response time for the query that doesn’t even pull data. With executions this fast, the queries are fast and light, but again, there are just a lot of them. The database server was not hard resource-stressed, because these queries were issued serially and weren’t enough to trip CPU alerts.
Root Cause: Ever Growing JIRA Ticket from 2012
I looked at the Jira ticket that had taken 24 seconds to load. What I found was that this was one of our busiest engineer’s daily ticket, and it had entries going back to the beginning of 2012. That meant there were several hundred time entries on the ticket that had to be loaded, each of them with several time lines that might be ticket handles.
Before the DBA could get back to me on my hypothesis, our data center manager saw this analysis, and asked his engineers to close their daily tickets and create new ones. In addition to having the long-running requests stop appearing in the PurePaths list, the performance of the status page improved to under 5 seconds. Anecdotally, performance smoothed out for everyone, so the DBA stopped looking.
The PurePath makes the difference
I think Dynatrace and especially PurePath distinguished itself in a couple of important respects here. Other monitoring approaches didn’t find this issue. Resource monitoring, long-running query reports, and other passive instrumentation didn’t see this issue. Seeing what exactly was slow, and what occurred to cause it shortened troubleshooting tremendously. We didn’t write this code, so we didn’t really know where to look.
Speaking of which, Dynatrace replaced anecdotal reports of occasional slow performance with facts. The value of knowing where to look is huge. I’ve spent significant hours in my career chasing phantoms of “the system is slow sometimes.” This time, I got to skip ahead to seeing what the actual problem was. Thanks, Dynatrace!