In my years of experience with Weblogic monitoring, I came up with a list of key metrics that can help determine the health of my server farm. These early indicators allow me to set pro-active steps instead of waiting for end users to complain.
The following screenshot shows one of my Dynatrace dashboards containing key health metrics captured through JMX:
Metrics to become Proactive instead of Reactive
Monitoring is a proactive, not reactive, approach to system management. My philosophy is that IT should know about system issues before any customer ever calls to notify us. In fact, IT needs to be notifying internal and external customers using our systems that they may experience a decrease in performance prior to customers noticing the system degradation. This allows a stronger working relationship between IT and users/customers. It also decreases the amount of firefighting needed and reduces the number of people needed to isolate and resolve system issues. While you must initially take the time to learn your system and gather data, it is an invaluable investment of time and pays huge dividends in terms of system efficiency.
The following is another dashboard I use to analyze a specific metric area. In this case it tells me my system is impacted by hogging/stuck threads on 2 of my 6 JVMs in the cluster:
My 10 Key Metrics
Now – let me get into some of the key areas I personally monitor and an explanation of why I monitor them. Note, they are not in any specific order. The list below is only a partial list to provide an example of the type of data you can use to tune, troubleshoot and learn about how your system performs.
#1: JVM – Percent of time in Garbage Collection
GC is a stop the world process. So it is very important to verify the system is not spending too much time in this state. This metric is also helpful in validating configuration changes and for capacity management. Factors that go into this include system cores, memory allocated etc…
#2: Execute Thread Counts
When WebLogic opens up too many threads to service the load, there is a decrease in performance. Threads take resources (CPU, Memory). This metric can be used for monitoring and also for capacity planning.
#3: Workmanager Thread usage
A Workmanager is used to limit resources or to ensure that the right application gets the resources. Here is where you can validate the Workmanagers are not capped at an inappropriate level etc…
Current Capacity vs. Current Capacity High allows you to validate that you have the correct amount of resources available to service the client’s needs. It’s also helpful to determine if you need to increase or decrease the pool size. While Connection Delay Times can be used to determine DB responsiveness.
#5: Application Health and Applications deployed
Validate all deployments are deployed to the correct servers and are in an active state. I can’t count the number of times Weblogic said everything was Active but the server it was deployed to said “Failed.” IT can also start monitoring active sessions to the servers. This is great data for capacity planning.
#6a: JMS Oldest Message Age
Normally it doesn’t matter how many messages are on the queue, it is a good idea to pay close attention to how old the oldest message is. This is normally a key indication of issues, or shows that the system is being affected by an excessive message dump on the queue – something the system cannot keep up with the load (capacity). This can be verified (below).
#6b: JMS Consumers
In the picture above, you are able to see how many consumers are on the queue. If no consumers are visible, then we don’t process messages.
#7: Cluster Server Alive Count
This metric ensures all servers in the cluster are talking and know about each other.
#8: Server Listen Address
This metric allows you to know if all servers are communicating properly to the Admin Server. If the listen address is not reporting properly (<host>/<IP>) the managed server is not communicating with the Admin Server. You lose the ability to monitor and also troubleshoot through the console. Normally this happens when the server is under extremely heavy load or, depending on your Weblogic version, it is a Weblogic bug.
#9: Server Running Time
This is a great metric to catch servers that crash and get restarted by Nodemanager. 🙂
#10: Monitoring Time
You want to limit monitoring so that you leave the maximum resources available for the system. I’m often asked how to figure out the right balance of system monitoring. Every system is different so there is no magic number. A few key things to look at:
- Are your servers properly sized? Is there enough CPU/Memory available?
- You have monitoring in place, on the individual servers, so you can validate whether or not you are placing undue load on those servers.
- With the Dynatrace agents you can compare response times to your monitoring intervals to see if you are straining the system or notice any performance impact.
Those pieces of information can help you determine the appropriate amount of monitoring. In a future blog, maybe we can cover these in more detail.
Follow My Lead: Start Monitoring Today
Hopefully, this has given you some helpful information about how to improve your system monitoring. For more information specifically about the plugin check out my WebLogic Monitoring Plugin on the Dynatrace Community.
If you want to connect with me directly, please visit me on LinkedIn at: http://www.linkedin.com/in/toddaellis