APM Challenges of Operating in a Virtualized Environment

Today’s dynamic IT environments are experiencing constant changes. Virtualization and agile development driven by ever-changing business requirements are driving frequent application and infrastructure changes. This persistent change presents challenges to those charged with ensuring end-user experience.

One of the downsides to virtualization is the impact it has on [application performance management (APM) solutions. As new hardware and software technology is introduced into the IT environment, IT needs to be able to maintain a complete 360 degree view of application performance in the new environment in order to ensure service delivery promises and that SLAs are met.

However, virtualization represents a significant “blind spot” for APM. The performance of applications running in this environment can be almost invisible to many legacy APM solutions, making it hard to isolate, analyze and fix performance problems as they occur. The loss of visibility when moving applications to a virtualized environment can dramatically affect the business outcome of a virtualization project, increasing the need for a flexible, extensible APM solution.

For this post, I’m going to focus on the APM challenges when implementing and running a VMware ESX server environment.

Many of the application monitoring approaches which have been in place for the last 10 years leverage physical server parameters — CPU and memory utilization, disk and network card “health checks,” etc. — as an indicator of application performance. For instance, if users start complaining about long login times, a look at CPU utilization on each of the servers involved (the database server and the LDAP server) immediately shows that the LDAP server application has high CPU utilization, warranting further investigation. High CPU activity on the database server is quite normal.

When applications are virtualized and collapsed inside a single piece of hardware, the “one-to-one” relationship between applications and hardware becomes a “many-to-one” relationship, and legacy monitoring solutions lose their analytical capabilities. Key challenges include:

  • limited visibility into transaction, especially between VMs on the same ESX host
  • limited visibility into the physical-to-virtual relationship between hardware and applications
  • difficulty understanding the performance impact of Virtual Machine Managers (VMM).

Each of these has a serious impact on any APM solution that isn’t configured to operate in a VMware environment, so let’s take a look at each in more detail.

Limited visibility into transactions, especially between VMs on the same ESX host: Legacy monitoring solutions do not provide visibility into the performance and availability of individual business-critical applications running on VMs.

The underlining issue is that not only are the applications themselves virtualized, but the networks and storage systems that they use are also virtualized. The network traffic between two applications on a single ESX server is also virtual, and not exposed to an APM agent monitoring the physical NIC card.

Limited visibility into the physical-to-virtual relationship between hardware and applications: The legacy APM challenge becomes even greater when ESX’s advanced “Dynamic Provisioning” feature is used, allowing applications to be dynamically re-provisioned to any available host in the virtual server pool.

A legacy APM solution only understands static provisioning (applications have a fixed mapping to specific physical servers), so when a dynamically provisioned server shows indications of a problem, the APM solution could easily attribute it to the wrong application.

Even worse, if users start to complain about performance, the APM solution could easily have an application mapped to the wrong server, sending IT troubleshooters off in completely the wrong direction. In both situations, the APM solution provides misleading and inaccurate reports, leading to confusion and significant delays in resolving the problem.

Difficulty understanding the performance impact of Virtual Machine Managers: The VMM or “Hypervisor” is an underlying control program which manages the creation, scheduling/prioritization and termination of virtual machines (Guest VM images) on each ESX server.

Although it is completely invisible to each VM, the Hypervisor can have a significant impact on the performance of applications running in the image. For instance, if it assigns a low priority to a particular VM image, the applications in that image may appear to be running slowly.

The APM solution may conclude, incorrectly, however, that there is an application issue or the server is failing. The Hypervisor can also introduce effects which, if not properly understood by the APM solution, can lead to erroneous results. For instance, the passage of time in a VM is not always in sync with the passage of time in the real world. This phenomenon, called “clock skew,” can be accounted for if the APM solution is able to communicate with the Hypervisor.

Virtualization makes it more difficult to identify most application performance problems, understand the business impact they may cause, and isolate the root cause. Troubleshooting an intermittent slowdown in a virtualized environment becomes a big challenge for even the most seasoned IT professional. Where do you look? What exactly do you look for?

Many of the current metrics typically used as proxies for application performance will return invalid or misleading data in a virtualized environment. Even as simple performance metrics such as CPU utilization percentage and memory consumption must be viewed and analyzed in a completely new light.

In my next post, I’ll discuss how to solve the APM challenges in a VMware environment.