In my years of experience with Weblogic monitoring, I came up with a list of key metrics that can help determine the health of my server farm. These early indicators allow me to set pro-active steps instead of waiting for end users to complain.

The following screenshot shows one of my Dynatrace dashboards containing key health metrics captured through JMX:

Dynatrace Dashboard showing the health status of all main components in the Weblogic domain: Deployment State, Server Health, Threads, JVM, JDBC, JMS etc…
Dynatrace Dashboard showing the health status of all main components in the Weblogic domain: Deployment State, Server Health, Threads, JVM, JDBC, JMS etc…

Metrics to become Proactive instead of Reactive

Monitoring is a proactive, not reactive, approach to system management. My philosophy is that IT should know about system issues before any customer ever calls to notify us. In fact, IT needs to be notifying internal and external customers using our systems that they may experience a decrease in performance prior to customers noticing the system degradation. This allows a stronger working relationship between IT and users/customers. It also decreases the amount of firefighting needed and reduces the number of people needed to isolate and resolve system issues. While you must initially take the time to learn your system and gather data, it is an invaluable investment of time and pays huge dividends in terms of system efficiency.

The following is another dashboard I use to analyze a specific metric area. In this case it tells me my system is impacted by hogging/stuck threads on 2 of my 6 JVMs in the cluster:

Detailed information about Thread States makes it easy to identify problems such as stuck threads or even Workmanager capacity issues. This allows us to take proactive steps prior to customer complaints.
Detailed information about Thread States makes it easy to identify problems such as stuck threads or even Workmanager capacity issues. This allows us to take proactive steps prior to customer complaints.

My 10 Key Metrics

Now – let me get into some of the key areas I personally monitor and an explanation of why I monitor them. Note, they are not in any specific order. The list below is only a partial list to provide an example of the type of data you can use to tune, troubleshoot and learn about how your system performs.

#1: JVM – Percent of time in Garbage Collection

Time Spent in GC is a key indicator of the apps way to use memory
Time Spent in GC is a key indicator of the apps way to use memory

GC is a stop the world process. So it is very important to verify the system is not spending too much time in this state. This metric is also helpful in validating configuration changes and for capacity management. Factors that go into this include system cores, memory allocated etc…

#2: Execute Thread Counts

Watch the number of threads as they give an indication on how well your system runs
Watch the number of threads as they give an indication on how well your system runs

When WebLogic opens up too many threads to service the load, there is a decrease in performance. Threads take resources (CPU, Memory). This metric can be used for monitoring and also for capacity planning.

#3: Workmanager Thread usage

Workmanager Threads ensure resources are properly assigned to applications
Workmanager Threads ensure resources are properly assigned to applications

A Workmanager is used to limit resources or to ensure that the right application gets the resources. Here is where you can validate the Workmanagers are not capped at an inappropriate level etc…

#4: JDBC

Determine whether you have proper sizing and isolate connection leaks in your app
Determine whether you have proper sizing and isolate connection leaks in your app

Current Capacity vs. Current Capacity High allows you to validate that you have the correct amount of resources available to service the client’s needs. It’s also helpful to determine if you need to increase or decrease the pool size. While Connection Delay Times can be used to determine DB responsiveness.

#5: Application Health and Applications deployed

Actively monitor your application health and not just rely on the WebLogic Health Status
Actively monitor your application health and not just rely on the WebLogic Health Status

Validate all deployments are deployed to the correct servers and are in an active state. I can’t count the number of times Weblogic said everything was Active but the server it was deployed to said “Failed.” IT can also start monitoring active sessions to the servers. This is great data for capacity planning.

#6a: JMS Oldest Message Age

Old messages in a queue means that your system can't keep up with processing them
Old messages in a queue means that your system can’t keep up with processing them

Normally it doesn’t matter how many messages are on the queue, it is a good idea to pay close attention to how old the oldest message is. This is normally a key indication of issues, or shows that the system is being affected by an excessive message dump on the queue – something the system cannot keep up with the load (capacity). This can be verified (below).

#6b: JMS Consumers

Make sure to monitor all key metrics for JMS
Make sure to monitor all key metrics for JMS

In the picture above, you are able to see how many consumers are on the queue. If no consumers are visible, then we don’t process messages.

#7: Cluster Server Alive Count

This metric ensures all servers in the cluster are talking and know about each other.

#8: Server Listen Address

Verify availability and responsiveness of the server listening port.
Verify availability and responsiveness of the server listening port.

This metric allows you to know if all servers are communicating properly to the Admin Server. If the listen address is not reporting properly (<host>/<IP>) the managed server is not communicating with the Admin Server. You lose the ability to monitor and also troubleshoot through the console. Normally this happens when the server is under extremely heavy load or, depending on your Weblogic version, it is a Weblogic bug.

#9: Server Running Time

Catch servers that keep restarting due to crashes by watching the total run time
Catch servers that keep restarting due to crashes by watching the total run time

This is a great metric to catch servers that crash and get restarted by Nodemanager. 🙂

#10: Monitoring Time

Keep an eye on the overhead of your own monitoring
Keep an eye on the overhead of your own monitoring

You want to limit monitoring so that you leave the maximum resources available for the system. I’m often asked how to figure out the right balance of system monitoring. Every system is different so there is no magic number. A few key things to look at:

  • Are your servers properly sized? Is there enough CPU/Memory available?
  • You have monitoring in place, on the individual servers, so you can validate whether or not you are placing undue load on those servers.
  • With the Dynatrace agents you can compare response times to your monitoring intervals to see if you are straining the system or notice any performance impact.

Those pieces of information can help you determine the appropriate amount of monitoring. In a future blog, maybe we can cover these in more detail.

Follow My Lead: Start Monitoring Today

Hopefully, this has given you some helpful information about how to improve your system monitoring. For more information specifically about the plugin check out my WebLogic Monitoring Plugin on the Dynatrace Community.

If you want to connect with me directly, please visit me on LinkedIn at: http://www.linkedin.com/in/toddaellis