Working with APM solutions often puts us in the spotlight when applications have problems. Many of us follow a common series of “phases” with some key differences depending on how the application lifecycle is implemented both with processes and tooling.
This is a story about a customer I found particularly interesting because they showed the value of not only using APM in production, but also verifying in test, as well as using collaboration functionality between the stakeholders in the lifecycle (dev, test and prod).
In the end, response time for the most important business transaction was reduced from 13.3 seconds to around 4s, which took them from 11.8% percent of transactions meeting SLA to more than 90%.
The business side of a large Nordic insurance company noticed that there was a drop in online business for their private insurances. The business contacted the IT department but they were unable to help with what would be the cause or even to confirm the issue. Everything looked green on their side. The business side started testing the webpage manually with a stop watch and found the response times for some functions were very slow. They contacted the responsible team for the e-commerce part of their webpage with their concerns and manual findings and they now confirmed: “right, this takes a lot of time but there are so many things that have to be done in the transaction.”
The business side pushed on to try to understand the impact of the problem: is it slow for real end users? Who was affected: everyone or specific geographies, certain time periods, just during high peak or always? They needed to confirm the issue from outside the company walls.
Confirming Suspicions and Understanding Impact
They implemented synthetic monitoring to investigate if the issue was seen from the Internet and from different geographies. The more straightforward way of troubleshooting would have been to go directly to real user monitoring, but their outsourcer posed some issue that was delaying this. And sure enough, response time for the most important business transaction, to produce a price for the insurance, was very high, around 12 seconds. Personally, I would be refreshing if things take more than 5 seconds and studies show that most people would. That of course not only generates users that are frustrated, but also generates even more load on the system.
This confirmed that there was an issue. The next step was to verify and understand if and how many end users were actually affected. Application Aware Network monitoring was deployed on the frontend of the application. Sure enough, real endusers were very affected by the problem!
Many studies show that when the load time of the page increased, conversion rate is dramatically decreased. The below graph is built from the extensive information extracted from our Dynatrace User Experience Management as a Service and covers more than 30 Major Retailers across the globe.
Being in a competitive market where the competition is just a click away definitively doesn’t allow for these kind of response times!
The insurance company also had a partner agreement with a bank that said that the ‘get price’ transactions should not take more than 4 seconds. Since the fulfillment of that SLA was not measured, no one knew if it was being fulfilled.
Getting to the Root Cause
Their service provider didn’t allow them to extend their insight further into the application by using probe based Application Aware Network Monitoring into the datacenter and hence stopped the possibility for fault domain isolation that they were using for other applications.
It was decided that Dynatrace Application Monitoring should take the place of the already used APM solution (that had been unable to help with the problem) to see if it could help locate the problem. Dynatrace Application Monitoring was implemented and after a short session (a few hours) an action list with 7 points was created including Updating XML Parser, Optimizing Garbage Collection, Load Balancer Settings and more.
It was quite easy to see that there was one major point that could potentially decrease the response time substantially; A point that was found using the Transaction Response Hotspot analysis function in one literally two clicks. Cick one is “Responsetime hotspots” for the specific Business Transaction (so looking at all executions of a specific type and analyzing this):
As can be seen, a lot of time was spent on Reflection and XML processing api, but what is using that?
Second click took us here:
This answered that question. It is very important to not only be able to report a method that causes a lot of problems, but also *where* that problem is used (Copy could be used in MANY places, but which one is where to improve?)
Using APM to Quicker Resolve the Issue From Production to Development
The information was extracted from the production system and shown to a developer who could quickly relate it to a caching function they had implemented. This was unexpected, but he would certainly take a look. The information provided by Dynatrace Application Monitoring from production was sufficient enough to find the issue and with all the data needed for the developer to fix it. The fact that you have enough data with enough granularity was crucial in this case. APM solutions that don’t provide the full depth would have pointed them in the right direction, but not to the actual root cause. Also, remember that the response time for this transaction was normal, so it may not have been picked up by solutions that sample. Having *all* transactions helps a lot when not just only looking for errors but also bottlenecks!
Another interesting finding that was done with Dynatrace Application Monitoring is this:
See how much transactions are affected by garbage collection? That affects the performance of *all* transactions running in that .net application pool. In this case, this pool housed not only this application but also other parts of their internet facing application!
Using APM in Test to Verify the Fix
When the transaction was fixed, the same transaction was tested in their test environment monitored by Dynatrace Application Monitoring. This is what it looked like in test before fix:
It was easy to spot that the change because there was a huge improvement by comparing the tests results in Dynatrace Application Monitoring looking at the cpu time. This meant the new release was now secure for going into production.
After the fix:
The release was put into production and it is pretty easy to spot when the change was in production.
For real end users the results now looked like this:
We also compared the results by looking at each individual transaction to make sure it all looked good:
Since it was also a good idea to look at the Garbage collection that was affecting the application (and the others in the same application pool), this simple graph was created:
The performance issue had probably been introduced in a release put into production sometime almost a year ago but due to the lack of using APM in development, lack of good regression testing in test combined with APM and the lack of monitoring in production it was missed in all parts of the Application Lifecycle and it cannot really be determined when this reached production.
Needless to say the project was a great success!
Management and business has now seen the impact and it is also possible to build a fair estimate of the cost that the problem created (comparing revenue now with when the issue was there). This means greater interest and sponsorship to create and maintain process around APM in the lifecycle in the company.
The company is now also looking into extending the monitoring of the transactions from the distributed side all the way into the mainframe down to DB2 calls with Dynatrace APM4MF with simply adding a z/OS agent. Maybe that will be the next chapter?
Call to Action: Do a Sanity Check on your Website!
If you are struggling with performance issues, don’t spend your time with guess work. Check out our blogs on top performance problems and start with a sanity check on your external websites using services such as Performance Center Test. Once you have proof that individual pages have a problem dig deeper by using tools such as Dynatrace Application Monitoring where you can sign up for a 30 Day Free Trial.