VMWorld 2009 – vCloud and Performance Monitoring

It is Day 3 at VMWorld 2009 and the “promised” announcements during the yesterday’s keynote finally hit the wire. 1000+ Service Providers – including AT&T, Verizon, Savvis, Terremark – are going to offer Cloud Services based on VMWare’s Cloud OS – read the full press release here: http://www.vmware.com/company/news/releases/vcloud-express-vmworld09.html

About vCloud API, VMWare Studio and AppSpeed

Today I had the chance to check out several sessions as well as to visit several people in the technology exchange area to talk about different technologies, products and problems. Here is what I picked up in the different areas

  • vCloud API: Announced to be submitted to become an open standard.
    The API’s allow tool vendors to better integrate into the vSphere vCloud Platform by querying information from the virtual environment as well as controlling it, e.g.: provisioning new virtual instances, defining new vApps, …
    A description of the API interfaces can be found here: http://communities.vmware.com/community/developer/forums/vcloudapi
  • VMWare Studio: http://www.vmware.com/appliances/learn/vmware_studio.html
    I checked out a demo about VMWare Studio that enables you to define and deploy vApps (Virtual Applications). Perfectly suited for development and testing to quickly deploy an application with multiple components (App Server, Web Server, DB, …) on a virtual server
  • AppSpeed: http://www.vmware.com/products/vcenter-appspeed/
    In the recent webinar about Application Performance Management in virtualized environments, Bernd Harzog talked about VMWare’s new offering for performance management. The tool is basically a network sniffer deployed as a virtual appliance on the vSwitch analyzing HTTP(S), SQL (MySQL, MSSQL, Oracle) and some MS Exchange traffic. It allows you to identify traffic flow between individual virtual machines and applications. Compared to an APM Solution like Dynatrace the limitation of AppSpeed – at least in its first version – are things like: lack of real transactional tracing (statistical correlation only gives a rough picture but no real tracing), lack of protocol support (Java & .NET Enterprise App also use RMI or WCF and not only HTTP), lack of supported databases (e.g.: PostgreSQL, DB2, …), no correlation with performance metrics like CPU, Memory (this is planned for upcoming versions), …
    The performance metric correlation – missing in AppSpeed right – brings up a different challenge in virtualized environments: Accurate Timing

Timekeeping problem with Performance Monitoring

It is a known fact that accurate application performance monitoring of Virtual Machines is error prone due to the timekeeping problem introduced with virtualization. A detailed description can be found in the following White Paper by VMWare: Timekeeping in VMWare Virtual Machines. It basically says that every counter taken from within the Virtual Machine might not be accurate due to that issue. The White Paper Managing Virtual Application Performance sees this as a very fundamental issue with today’s monitoring solutions and therefore questions all results captured by traditional Application Performance Management solutions.

I’ve been asking around today at VMWorld – both VMWare folks as well as implementation partners and VMWare users – whether this is really a problem. The answers ranged from- “Yes it is a problem as it affects accurate time measurement and therefore things like performance management and SLA enforcement”– over – “We are aware of the technical issue but we don’t know how it impacts us” – to – “Our performance problems are mainly related to storage and we don’t care about accurate timings within the VM”.

Call to Action

In order to verify some of the arguments that are made in that space and to get a broader overview of what is really happening in the field I encourage you to share your experience and thoughts about this topic. Are you aware of the time keeping problem? Have you experienced irregularities when monitoring performance? Do you have best practices or other suggestions on this issue?

Andreas Grabner has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences and regularly publishes articles on blog.dynatrace.com. You can follow him on Twitter: @grabnerandi