In Part I of this blog I explained which metrics on the Web Server, App Server and Host allow me to figure out how healthy the system and application environment is: Busy vs. Idle Threads, Throughput, CPU, Memory, et. Cetera.

In Part II, I focus on the set of metrics captured from within the application server (#Exceptions, Errors, …) as well as the interaction with the database (connection pools, roundtrips to database, amount of data loaded …). Most of the screenshots shown in this blog comes from real performance data shared from our Dynatrace Free Trial users that leveraged my Share Your PurePath program where I helped them analyze the data they captured. I also hope you comment on this blog and share your metrics with the larger performance testing community.

1. Top Database Activity Metrics

The database is accessed by the application. Therefore I capture most of my database metrics from the application itself by looking into executed SQL Statements:

  • Average # SQLs per User Over Time
    • If #SQLs per average user goes up we most likely have a data-driven problem. The more data in the database – the more SQLs we execute
    • Do we cache data, e.g: Search Results? Then this number should not go up but rather down as data should come from the cache.
  • Total # SQL Statements
    • Should at a max go up with number of simulated users
    • Otherwise it is a sign of bad caching or data driven problems.
  • Slowest SQL Statements
    • Are there individual SQLs that can be optimized both on SQL level or in the database?
    • Do we need additional indices?
    • Can we cache result data of some of these heavy statements?
  • SQLs called very frequently
    • Do we have an N+1 Query Problem?
    • Can we cache some of that data if it is requested over and over again?

The following screenshot shows a custom dashboard showing the number of database statements executed over time and on average per transaction/user:

Over time the number of SQLs should go down per end user as certain data should be cached. Otherwise we may have data driven or caching problems.
Over time the number of SQLs should go down per end user as certain data should be cached. Otherwise we may have data driven or caching problems.

The following screenshot shows the my Database Dashboard that provides several different diagnostics option to identify problematic database access patterns and slow SQLs:

Optimize individual SQLs but also reduce the execution of SQLs if results can be cached.
Optimize individual SQLs but also reduce the execution of SQLs if results can be cached.

2. Top Connection Pool Metrics

Every application uses Connection Pools to access the database. Connection Leaks, holding on too long on connections or not properly sized pools can result in performance problems. Here are my key metrics:

  • Connection Pool Utilization
    • Are the pools properly sized based on the expected load per runtime (JVM, CLR, PHP…)?
    • Are pools constantly exhausted? Do we have a connection leak?
  • Connection Acquisition Time
    • Are we perfectly configured and just need the amount of connections in the pool?
    • Or do we see increasing Acquisition time (time it takes to get a connection from the pool) which tells us we need more connections to fulfill the demand.

Following screenshot shows a custom dashboard showing JDBC Connection Pool Metrics captured from WebLogic via JMX:

Are connection pools correctly sized in relation with incoming transactions? Do we have connection leaks?
Are connection pools correctly sized in relation with incoming transactions? Do we have connection leaks?

The following screenshot shows a Database Dashboard automatically calculating key metrics per connection pool:

Acquisition Time tells us how long a transaction needs to wait to acquire the next connection from the pool. This should be close to zero.
Acquisition Time tells us how long a transaction needs to wait to acquire the next connection from the pool. This should be close to zero.

3. Error Detection

When an application starts throwing errors under load it is time to look a bit closer in what these errors are. For me I look at the following metrics:

  • # of Exceptions
    • How many Exceptions are thrown overall?
    • What’s the ratio of Exceptions to Log Messages?
    • Do we have lots of “framework internal” exceptions caused by configuration problems?
  • # of Log Entries
    • How many log entries do we write?
    • Do these log statements make sense to the developers or wasting file space?
    • Is the right information logged?
  • # of Errors
    • Do Errors increase under a certain load? Where is the breaking point?
    • How many HTTP 5xx, 4xx, 3xx do we have?
    • Are they all caused by load or maybe deployment problems?

The following screenshot shows a custom Dynatrace chart correlating number of thrown Exceptions and Log Messages written:

# of Exceptions and Log should correlate. Ratio tells us whether many Exceptions never make it to a log file and are handled “internally” in frameworks.
# of Exceptions and Log should correlate. Ratio tells us whether many Exceptions never make it to a log file and are handled “internally” in frameworks.

The following screenshot shows the Dynatrace Errors Dashlet making it easy to analyze which types of errors got detected on which types of transactions:

Are errors caused by deployment mistakes or happening only under load due to application problems?
Are errors caused by deployment mistakes or happening only under load due to application problems?

Your Key Metrics?

This concludes my set of top metrics I watch out for during load testing. There are more for specific use cases and I am sure you have your own metrics that you always take a look at. Feel free to share them with us by commenting on this blog.

I also did a How to do Load Testing with Dynatrace Performance Clinic and put it on my YouTube Channel. Check it out and see how I live analyze some of these screenshots shown here.