Monitoring Applications in Virtualized Environments

Chapter: Virtualization and Cloud Performance

Virtualization vendors provide built-in tools for monitoring virtual machines and their underlying hosts. One can generally obtain metrics for utilization and throughput, and sometimes for latency of the virtual infrastructure. This allows us to maximize utilization while keeping latency measures low, but we lack the context to guarantee that our application runs smoothly. Without understanding the impact of virtualization itself, we can’t understand how hardware latency or utilization actually affect an application’s performance.

In a virtual environment, the concept of real time is difficult. We’d like to continue using transactional performance as our measure of optimization, but how do we go about measuring the response time of a single transaction in a virtual world ( Figure 7.4 )? The guest system has no awareness of real time, so we must either use some form of virtualization-aware timer, like a tickless timer, or use something like a network appliance to measure outside of the virtual machine.

We need real response time and real latency time to identify the fault domain

Figure 7.4: We need real response time and real latency time to identify the fault domain

To identify latency issues in a distributed system (as depicted in figure 7.4), we need real response time metrics for every tier from both the client and server side of a call (see red boxes Figure 7.5). To understand the cause, we need to measure the transaction load on the application and how inter-tier latency is affected by that.

Real Time on exit and entry points helps identify latency issues and the fault domain

Figure 7.5: Real Time on exit and entry points helps identify latency issues and the fault domain

If inter-tier response time is rising with otherwise-stable tier response times, then we most likely face an overloaded virtual network. We can either reduce the load in terms of application communication or transactions, or talk to the network and VM administrators.

If the response time of a tier is rising while transaction load and CPU consumption remain stable, we likely have de-scheduling issues. We can check this by correlating the CPU ready/steal time to our tier response times. If we see an equivalent rise in steal time, we know that the VM doesn’t get enough CPU time allocated, which again results in a walk to our VM guy’s office.

Defining Key Virtualization Metrics

The list of virtualization metrics is long and can be very daunting at the start (see Figure 7.6). For this reason I have identified a list of key metrics that focus on measuring how virtualization and resource shortage impact the application.

This shows a typical monitoring dashboard for a VMWare instance

Figure 7.6 : A typical monitoring dashboard for a VMware instance - CPU Ready Time/CPU Steal Time

A measure of overhead caused by the hypervisor. More specifically, time during which the VM was suspended and therefore unable to execute CPU instructions. There is always a small, measurable overhead, but it should never grow beyond the 5% range. If it does, the underlying hardware is overloaded. Much as garbage-collector suspension time correlates to application performance, there is a direct correlation between steal time and response time. one can even express steal time as a percentage of response-time-delay. with this, one can actively monitor the impact of virtualization on application performance.

These metrics enable us to monitor and detect any negative impact that a misconfigured or overloaded virtualized system might have on our application. As a next step we need to understand how the cloud is different and why it represents an even greater challenge.

Read the Java enterprise performance eBook online

Chapter: Application Performance Concepts

Differentiating Performance from Scalability

Calculating Performance Data

Collecting Performance Data

Collecting and Analyzing Execution Time Data

Visualizing Performance Data

Controlling Measurement Overhead

Theory Behind Performance

How Humans Perceive Performance

Chapter: Memory Management

How Garbage Collection Works

The Impact of Garbage Collection on application performance

Reducing Garbage Collection Pause time

Making Garbage Collection faster

Not all JVMS are created equal

Analyzing the Performance impact of Memory Utilization and Garbage Collection


GC Configuration Problems

The different kinds of Java memory leaks and how to analyse them

High Memory utilization and their root causes

Classloader releated memory issues

Out-Of-Memory, Churn Rate and more

Chapter: Performance Engineering

Approaching Performance Engineering Afresh

Agile Principles for Performance Evaluation

Employing Dynamic Architecture Validation

Performance in Continuous Integration

Enforcing Development Best Practices

Load Testing—Essential and Not Difficult!

Load Testing in the Era of Web 2.0

Chapter: Virtualization and Cloud Performance

Introduction to Performance Monitoring in virtualized and Cloud Environments

IaaS, PaaS and Saas – All Cloud, All different

Virtualization’s Impact on Performance Management

Monitoring Applications in Virtualized Environments

Monitoring and Understanding Application Performance in The Cloud

Performance Analysis and Resolution of Cloud Applications

Start your 30-day free Java monitoring trial!

Try for free Contact us