As a services architect I have visited a plethora of customers & mature IT shops from around the world. The following story is from one of my latest engagements at a prominent Belgian company that hosts various newspapers using an Apache/PHP infrastructure that is running Drupal. This organization had been struggling with performance issues for a while but had no way to dive into the transactions to analyze them on method level. This was all about to change!

We found major performance hotspots around excessive calls to the database (N+1 Query Issue) as well as performance impacts caused by getimagesize() which was called for images returned from external services such as Facebook, Twitter and YouTube.

So – here is a summary what we did during during this two day engagement:

  • Downloaded and installed the latest version of Dynatrace Application Monitoring
  • Instrumented application processes
  • Performed an initial analysis and report back the findings about these hotspots

Mapping Transactions

The first two steps were executed in no time. The person I was working with download and installed Dynatrace on his own. The only thing we did when I came onsite is providing the latest version of Dynatrace which automatically updated all other components he had set in place before. This saved quite a lot of time. Imagine having to manually update an installation with a thousand agents!

Once all the agents were update and reconnected, the structure of the system was automatically detected, and the following picture showed us the flow of the transactions:

Transaction Flow showing how requests from End Users are executed by the deployed PHP application
Transaction Flow showing how requests from End Users are executed by the deployed PHP application

As we glance over this view, we can immediately notice a couple of items

  1. On average, around 32% of all transaction time was spent in the MySQL database. There were around 4000 requests hitting the Apache servers per minute, generating roughly 255 000 calls to MySQL (or about 65/request).
  2. There were server-side calls made to ap.twitter.com – of which some were failing – and graph.facebook.com.
  3. The average transaction time was about 1s, this also includes very fast transactions and static content. It turned out there were individual requests that were taking a couple of minutes as well. As we will see later, these slow response times were caused by a combination of various factors, including the two we see above.

Let’s start analysing a bit deeper.

Database access

When we look at the 32% contribution and 255 000 calls per minute to MySQL, we have to keep in mind that this number is averaged out across all server-side transactions. This means is that there will be transactions included that will have no database access whatsoever (think static content). More importantly, this also means that there will be transactions where the contribution and number of calls will be a lot higher than the average 65. So let us take a look at the slowest transactions.

PurePath of a single requests visualizes the top hotspot: Database Access
PurePath of a single requests visualizes the top hotspot: Database Access

In the screenshot above we focused on a particular end-user transaction that took 54 seconds to complete. Here we can see that this transaction called a particular MySQL database a whopping 2094 times, which accounted for 41 seconds or 75% of the total time. From there we can simply drill down to the database calls that were executed by this particular PurePath:

Database Dashlet shows full SQL and makes it easy to spot those that are very slow
Database Dashlet shows full SQL and makes it easy to spot those that are very slow

It was as simple as a right mouse click and to select the Drill Down to Database to see this view. Not visible in this screenshot, but we have access to the entire SQL statement. Being able to provide this information to application developers and DBAs is extremely valuable.

Furthermore we discovered that several calls were executed multiple times within the same transaction with the same parameters. Consider using server side caching to get rid of redundant calls.

For queries with different parameters they could be grouped together using an IN(val1, val2, valn) clause. Evaluate what is best for your performance! Check the number of executions and usage of indexes for different options. Especially when it comes to the database the results can be very diverging. The PurePath gives you a very good insight into what’s happening. Even though these statements are quite fast on their own, the fact that there are a couple of thousand redundant calls this puts unnecessary load on the database server.

Classical 1+N Query Problem where same SQL is executed all over again with different bind values
Classical 1+N Query Problem where same SQL is executed all over again with different bind values

Server-side calls to public services

One of the other items noted above was calls that were made from the server to the Twitter API. When we take all the PurePaths that make calls to this API (which is easily achieved by right-clicking on the external Twitter service on the Transaction Flow and then selecting Focus on Passing Transactions) we can narrow down our result list.

PurePath also highlights another common problem patterns: Excessive Calls to External Service APIs such as Twitter
PurePath also highlights another common problem patterns: Excessive Calls to External Service APIs such as Twitter

Here we see a transaction that takes 66 seconds with 30 of those spent on 46 calls to api.twitter.com. With Dynatrace, we can then look at the call stack of this transaction – or what we call the PurePath Tree. Here we see that each call takes around 650ms to execute.

Calls to twitter were not the only ones that went outside of the datacenter. As you can see below there were method calls to getimagesize() which returns … well the image size of the filename past as a parameter. These calls were requesting images from Twitter, Facebook and YouTube, each taking considerable time (up to 1.5s and more), totalling in this case to 18s out of 41 for 19 calls.

Performance Impact of each getImageSize call is clearly visibile in PurePath
Performance Impact of each getImageSize call is clearly visibile in PurePath

The limitation of getimagesize() is that it downloads the entire file first to then return the dimensions. If you have a large image, then this will obviously take longer. Additionally, if this gets executed frequently it will eat up bandwidth. In this case it will be combined with the frequent calls to the Twitter API, further increasing network utilization. There are alternatives techniques available to getimagesize() such as to download only the first couple of bytes which will contain the dimensions in the header.

Next Steps: Try it Yourself or Engage with My Team!

The cool thing is, it’s super easy for you to analyse your system. Get your free trial of Dynatrace, install it on your system and find the performance hotspots of your application. Monitor your transactions and be part of our “Share your PurePath” program. We will be happy to send you our feedback and tips you how you can improve the performance of your application!

If you want our expertise instead of doing this on your own feel free to reach out to us. To give you more background. My team and I are working with companies that range from smaller businesses with a 5 head IT team to large multinationals where the IT staff alone would make up a larger enterprise. As much as the customers vary in size and type, so do the reasons for our involvement at the customer. While Dynatrace pretty much installs itself, any capable system admin that either was involved during the PoC (Proof of Concept) or is able to spend 5 minutes looking at the manual can get Dynatrace Application Monitoring up and running in no time. Our support is usually not required for that exercise.

What our services team does do however, and we take great pride in this, is support the customer with APM best practices. We combine great solution expertise with APM experience to deliver value to customers. And getting feedback like “thanks to our consultant on site, we were able to solve critical issues that have been plaguing us for years in less than a day” is the best confirmation for our efforts. That is what makes us tick: helping customers solve problems and show the true value of the best APM solution in the market. A Porsche is only as good as the driver, and if that driver also happens to know the engine inside-out he or she will certainly put down record lap times!

These couple of findings outlined above were of immense value to the operations team and the developers. In a couple of hours they were able to identify what had been slowing down their application for a long time and how they could fix it. If that isn’t an impressive lap time, I don’t know what is.