On our Dynatrace blog we talk a lot about problem patterns such as too many database statements, wasteful memory management leading too much garbage collection, web performance worst practices or performance of cloud and virtualized environments.
I was recently intrigued by a couple of screenshots my colleague Reinhard showed me which give a perfect overview of load and response time behavior based on a set of very basic performance metrics. For me the following dashboards are a must have; especially if you are just starting with application performance management (APM) and performance monitoring, or you do not yet have this dashboard in your current APM tool. Whether you already have an APM product or whether you rely on home-grown or open source software – I am pretty sure you can create a graph similar to this one:
The key metrics that we are looking at here are the number of transactions on your system split by 4 different performance buckets: Faster than 1s, between 1 and 3s, between 3 and 5s, slower than 5s. Depending on the tools you have available you can either do this for all of your requests or even split it up into important groups of URLs, e.g: “/search”, “/products”, “/purchase.” When looking at all requests, it allows you to identify general load trends, spikes and very important: sudden changes in performance.
Static vs. Dynamic Baseline: A more sophisticated approach is a dynamic baseline where you do not use hard coded thresholds as in this example but use historical data and a good statistical approach to figure out what fast and slow really is for your application and your key business transactions. Want to read up on this? Smart Baseline and Smart Alerting Explained.
Splitting it by Web and App Server
If you have a distributed application that includes Web servers and application servers you should look at load distribution across these servers. The following is an extended version of the dashboard shown above. It now also includes the traffic of 2 load balanced Apache Web Servers and 6 Java Application Servers that get their traffic load balanced from the two Apaches. Screenshots are taken from a customer running this on their hybris eCommerce Platform using dynaTrace:
The charts above show some very interesting facts:
#1: Actual performance improvement on Aug 18 at 2PM, when the customer deployed a performance fix into their hybris Platform. It is easy to spot the decline of red (=slow) transactions but maintaining the same load
#2: At 10PM on the same day a short outage occurred. That is also easy to spot because overall traffic went down quite a bit
#3: The two Apaches seem to be equally load balanced and also reflect the performance improvement with a decline of the red area
#4: The load across the Java Application Servers is almost distributed equally. What is interesting is that the first application server doesn’t show the same improvement in performance as the others. It turned out that the fix was not yet deployed on that machine in the cluster
More Key Metrics for Web and Application Server
The last dashboard I found really interesting shows key metrics that you can easily query from your Web and application server and give you a great understanding on the internal request handling, throughput as well as resource consumption:
We very often see deployments of web and application servers where nobody thinks about the number of required worker threads or the required network bandwidth. Looking at these charts immediately shows you whether you have reached the current limit. Then you can discuss whether you need to add additional servers to your cluster or whether you can tweak the settings in your existing instances.
More Dashboards …
Reinhard built this and other dashboards for our hybris FastPack which our customers can download and test out. If you want to give this a try on your own application and you are not yet using a tool that can provide these metrics feel free to download and test the dynaTrace Free Trial. This also gives you access to additional downloads such as the ones Reinhard built.
Image attribution: Leonardo Rizzi