Let’s face it, neat code doesn’t always mean improved web app performance. During my career I have worked with many developers who take pride in what they write but, unfortunately, focus too much on theory and forget about performance.
As frameworks become increasingly advanced, developers tend to lose touch with what is happening “under the covers”. What looks like a very clean piece of code on the surface might hide thousands of lines of inefficient code written by someone else, and web app performance suffers!
These days it is so easy to use open-source libraries and bolt your own code on top of, but I have seen many cases where things start to go wrong once the system is put under load, and start failing to scale. Recent studies from large e-retailers tell us to focus on performance as it has a direct impact on conversion:
I was recently called in to help a customer who had received user complaints about the poor performance of their checkout pages on their IBM e-commerce platform. No one really could determine if it was a user perception issue or a genuine issue. The bottom line was that they were potentially losing money if their customers were abandoning the site.
Fortunately, they had Dynatrace installed on their production environment, making it easier for me when I came on site to help them determine what is going on.
Before attempting to resolve any issue, I needed to determine the extent of the damage. With Dynatrace it is very simple, using charts to help visualize the problem, allowing you to step back a little for a fresh perspective. For IBM e-commerce, Dynatrace also provides the WebSphere Commerce FastPack with pre-configured dashboards.
I decided to build my own dashboard to chart the number of calls (in blue), average response time (orange), and the maximum response time (yellow) for the checkout pages.
As you can see below, when load increases, the average response time stays fairly consistent, regardless of the load. The maximum response time doesn’t fluctuate much either. Big spikes can be seen even when there is little traffic, so there is no correlation between load and slow response times. This indicates that some users get very long response times regardless of the load, so there must be another factor influencing the response time.
Thanks to this chart we know that some of the pages take an excess of 400 seconds to be processed, but it doesn’t show how widespread the problem is. Moreover, the checkout process goes through several pages so some of them might be worse than others.
Improved web app performance:
Divide & Conquer with Percentiles
This is why I built a second dashboard which provides more information on what is occurring. I charted all the individual steps and added percentiles for the response times. Again, in blue, I added the total number of requests, with the line chart showing the 99th, 95th, 90th , and the average response time. Because many customers don’t realise that their performance issues are more widespread than they know, I also added the search page as it was a core step in the user journey.
Now we can see that some pages are worse than others! For example, The billingView section is the worst offender, with response times in excess of 35 seconds for 1% of the user base at 16:00. That’s 10 users out of 1,000 just for a 1-hour period which isn’t negligible! With the 90th percentile we can see that 100 users still experience a response time beyond 20 seconds which is not acceptable.
For a multi-million dollar business, disappointing even 5% of the users throughout the year means a potentially massive sales and revenue loss!
At this point, I could easily demonstrate that the checkout was indeed slow for quite a few users. I saved this dashboard because I wanted the customer to be able to measure improvements after I left. All the information is automatically saved in the Dynatrace performance warehouse and can be used for analysis and trending.
Now that I had proven that there was a problem, it was time to find out what was causing it.
Improved web app performance:
Hunting Problematic Code
Dynatrace really shines at showing you where the time is spent in your code. I have worked on many engagements where the problem was clearly highlighted in a matter of minutes but, as you will see, this situation was slightly more complicated, but we got there at the end!
As you can see below, something really jumps out ! Almost 50% of what is processed in all the JVMs is linked to JDBC. That’s a lot of database activity! I now knew I was onto something!
The method hotspots dashlet works in a similar fashion but this time is more specific. It will tell you which methods are consuming the most resources.
Here we can see what socketRead is the number one method, and the caller breakdown section gives more information. 74% is coming from oracle.net classes so it is confirming that the application is spending significant time talking to the database.
At this point, I knew I had to look at the database more closely. I used the database dashlet which gives aggregated information about all the SQL statements executed by the application in a given timeframe. On the screenshot below we can see that, for a 15-minute period, the application executed 12,000,000 SQL statements – an incredible amount of traffic!
Clearly this application relies on the database far too much. The next step was to assess if all the pages or only specific ones make these calls. I would normally right click and drill down to PurePath to see which request generate the calls but, because there was so much going on, I decided to build a Business Transaction to give me a high-level view.
Business Transactions are great for selecting data (aka PurePaths), and can produce aggregated results broken down into different dimensions. Because Dynatrace captures every transaction end-to-end, we can create Business Transactions based on context data from HTTP Parameters, Method Calls, Return Values or even SQL statements executed. If you want to learn more about Business Transactions check out our YouTube Tutorials on Business Transactions!
I wanted to see for each URLs which ones were making the most database calls and how long it was taking.
Here is a screenshot of the one I built:
It selects all the PurePaths which are part of the checkout or the search page (I used a regular expression), and will calculate the number of database calls and total database time as defined in the calculate results table. To obtain the actual URL, I split my results by complete URI path.
Here is the output:
Now we are getting a clearer picture of what’s going on.
OrderShippingBillingView makes on average 1,905 calls to the database, with an average time of 8.6 seconds! That’s for each single request to that URL! We can see that for 155 requests the code made 293,000 calls to the database! To help us compare, the search page was invoked 3,477 times and generated 274,000 calls nearly the same number of calls for many more pages.
This page was now on my hit list!
Improved web app performance:
Engaging Developers for PurePath Data Forensics!
Drilling down to specific PurePaths confirmed it for me!
At this point I engaged with the development team to help me understand how the code was structured and mainly why it was making so many calls.
After a few investigations, I knew that the code was primarily relying on the IBM ecommerce V7 framework so it used EJBs heavily and should cache a lot of the data in the “dynaCache”. Nothing to do with Dynatrace here – the name resemblance is really pure coincidental!
Other customs classes had been added on top of the framework so I needed to have closer look.
Because Dynatrace needs to keep the overhead of the agent as low as possible certain classes such as EJBs are not instrumented by default. Enabling this option in production is an option but we went ahead and gave it a shot in the development environment instead. I showed the development team how to install their own free for life version of Dynatrace Personal License on their local server.
The installation only took a few minutes. I showed them how to follow the configuration wizard and within seconds on WebSphere agent was reporting for duty!
We enabled the EJB sensor to see what was going on. Was the core IBM e-commerce framework at fault or the custom code? Was the dynaCache actually doing its job?
Once we ran the test harness, we found the relevant PurePaths and drilled down to the components dashlets. This screen will show all the EJBs invoked within the PurePath.
It confirmed what I was expecting, there are a lot of EJBs involved and most of which belong the core IBM objects. Now the big questions is what is getting cache and what isn’t?
Improved web app performance:
Data from all Transactions is Powerful — Compare to Learn
On the example below we examined two identical requests of our favourite page OrderShippingBillingView. The first one was taken after the WebSphere instance was brought up, and the second one after we ran the test several times.
Dynatrace captures 100% of the requests. There is no sampling of either aggregation, so if a small number of users suffer from bad performance, the data is available and not hidden among averages!
By drilling down to the database dashlet from the PurePath, we can see exactly which statements have been executed. The first line which shows 288 calls to the same statement only appear during the initial load but don’t appear afterward. This is tells us that WebSphere is caching the data so it is good news so far.
Unfortunately, things change for the other statements. The next one is called 239 times for each request. Surely there is room for improvement!
Here is an example taken directly from the PurePath which shows which EJB has to go to the database and which one doesn’t.
The one in red must, but the one in green doesn’t, so we know the cache is doing its job.
From here, the developers clearly understood the value of using Dynatrace to help them find their bottlenecks. In this case, it wasn’t going to be an easy task because the whole framework is at fault. Basically, the code accesses the data via java objects so the database calls are invisible to the developers.
Of course the code good looks neat:
But under the covers, the framework is probably making two calls to the database when a single one retrieving all the necessary data in one hit would be more efficient.
Additionally, the customer has an extremely large database of products so there is very little chance that the data would fit in the cache anyway!
The good news for this customer is that, after looking at the bug fixes that IBM released for later versions, performance, cache optimisation and reduction of unnecessary EJB calls are mentioned several times, they will be patching their application server soon. Now with Dynatrace they can see what is going on and — most importantly — they can accurately measure improvements or degradations!