If your application is hosted on virtual infrastructure you need to monitor much more than just the user experience of your customers. This post tells you exactly what you need to do to get the full picture of the health of your virtual infrastructure.
It was about 14 years ago when a guy from HP (or was it Compaq, that small spin-off they later shut down?) showed me how he essentially reproduced our local IT infrastructure for troubleshooting. At home. In his apartment!
“Wait a minute, you set up four servers just for troubleshooting? How big is your apartment?” I asked. That’s when I first learned about virtualization.
Virtualization is awesome. However, as with all things awesome and new, virtualized environments face their own problems that don’t exist with physical hosts. Here’s the story:
Our environment here at Dynatrace consists of several VMware ESXi hosts, each of which host several virtual machines. Inside one of these VMs, we have our web application running. The application is comprised of a frontend server and multiple backend services. By the way, one of our favorite ESXi features is vMotion, which automatically migrates VMs between hosts when resources run low on a certain ESXi host.
Dynatrace Ruxit notified us that our environment was affected by a performance degradation problem.
We were immediately made aware that 159 user actions per minute were affected, which indicated that this was quite a serious problem. The CPU seemed to be saturated, so we went on to investigate the details of the problem. We assumed we just needed to assign more CPU resources to the VM.
Gaining insight into the problem
By looking at the problem details page we quickly realized that it was not the virtual machine itself that lacked CPU power. The problem was that the ESXi host was exhausted.
A quick look at the Events section revealed what triggered this high CPU consumption incident—around the same time that Dynatrace Ruxit identified this problem (2:28 PM), a virtual machine called cpu-3-m5 was migrated to this ESXi host.
Any monitoring solution could have allowed us to check each guest’s CPU usage (assuming that we had made the required configurations, of course), but Dynatrace Ruxit allowed us to go further. We clicked the Consuming virtual machines button to view a list of all the VMs we needed to know about (We didn’t have to configure anything, by the way).
Now we could confirm that it was, in fact, the virtual machine eT-m5-Win7-64bit that was having trouble. The CPU Ready time measurement for this virtual machine was extremely high. The measurement of 80% told us that virtual machine eT-m5-Win7-64bit spent most of its time waiting for the hypervisor to assign it some CPU cycles. The other VMs on this ESXi host consumed nearly 96% of this host’s available CPU. Virtual machine eT-m5-Win7-64bit had only 2.33% of all available host CPU available to it—not nearly enough to enable it to perform its tasks.
A word about numbers
A Ready time measurement above 5% is a symptom of CPU contention taking place on an ESXi host. A measurement above 10% is a serious problem. This measurement indicates that the host doesn’t have enough resources to satisfy the demand of all the virtual machines it hosts. As a result, VMs are competing for CPU cycles. The machines that lose this competition can’t perform their jobs.
Note that if you were to only monitor the load of this embattled virtual machine, you wouldn’t detect any significant issue. This is because the problem originates with the ESXi host, which is not capable of providing adequate processing cycles to this virtual machine.
Have you ever had similar vMotion problems?
How do you determine if your host’s resources are depleted?
And like us (even on Facebook, if you want).
Try VMware monitoring for free! You don’t need a credit card and you’ll have it up and running within five minutes!