App in a Box – Part 1 – Infrastructure Monitoring

While attending a conference session about monitoring someone from the audience asked the following question; “According to Heisenberg’s uncertainty principal, the observation of an entity changes or destroys what is being observed…”  Aside from the obvious lack of quantum understanding and increased ego-centrism of the question, the intended point raised is one I hear a lot… “What is the impact of monitoring on the performance on my applications?”

Let me try to apply a more relevant pseudo-quantum theory to this question. Instead of Heisenberg’s Uncertainty Principal, the audience member probably meant to reference the ‘Observer Effect‘, so let us start with the premise that we may all agree upon: Schrodinger’s App; an idea that has been passed down for decades by developers to operations in that “an app is in a box, don’t open the box and it will run perfectly (and imperfectly at the same time)”.  In other words, I don’t build monitoring into my application because in my perfect environment, it works perfectly.  It is only broken because you replaced my reality with your own.

All physics humor aside, this is a real problem.  Absolutely, monitoring adds to an applications resource utilization.  But what is the cost of not monitoring at all?  For example, assume that we have a new airplane that has over 1 billion lines of code, 1 million sensors, redundant radar systems, altimeters and other monitoring components.  As they are unnecessary, let us remove them all.  Imagine how much lighter and performant our shiny new airplane must be as it rockets into the air and subsequently plummets to the ground.

This is not the cloud you are looking for!
This is not the cloud you are looking for!

This would never be acceptable, so why is it an acceptable argument for critical applications in the enterprise?  The real problem to be solved here isn’t if to add monitoring, but what level of monitoring and at what cost to performance.

Basic Infrastructure Monitoring

Let us start with Basic Infrastructure monitoring.  This includes network, hardware, operating systems and process monitoring.  Such monitoring is low hanging fruit in that it is usually easy to implement, standards based and generally non-intrusive to application code.  This type of monitoring includes platform monitoring such as AWS Cloudwatch, VMware vCenter, and common tools like Nagios and PRTG.  With the increasing popularity of Kubernetes, Prometheus is the common collector of such metrics in combination with influxDB and Grafana charting valuable data for dashboards and alerting.

Enabling basic infrastructure monitoring can help an operations team know much about resource utilization and hardware up-time but not very much about application availability.  There are additional infrastructure tools that can augment this with logs and packet monitoring however they are another solution that would have to be configured, maintained, and monitored.  When we speak of basic monitoring we are ‘basically’ saying you need to bundle tools together to get a two-dimensional picture of what is going on.  A typical enterprise could have 25+ monitoring tools to gather these infrastructure metrics and that is just too many screens and silos to make sense of.

Following with the earlier airplane analogy, applying only infrastructure and platform monitoring to your environment would be like flying a model airplane.  You can see it is in the air and tell that the engines are running and it is moving forward (hopefully on a level path), but what of the health of the engines?  How much fuel is being consumed by the extra load you took on this flight?  Are the passenger’s alive or did they freeze to death in a depressurized cabin?  Your user experience in this case doesn’t sound so good, does it?

Animated GIF - Find & Share on GIPHY

When evaluating the benefit of monitoring, it is important to no longer think in a silo’d view.  In the early days of application monitoring, the Operations team could argue “the servers are up, there is plenty of CPU/RAM/DISK available and the network isn’t saturated.  In fact, the servers are idle… it must be an Application issue.”  By contrast, the Application teams would be arguing that their “application is running fine from where they sit, but the customers are complaining it’s broken and that this is an Operations issue.”

TL;DR

I want to emphasize that infrastructure monitoring is a valuable gauge in your cockpit, however it only gives you a subset of the information you need to identify and solve problems that impact your users. What is missing in this scenario is the customer perspective of your applications.

Real User Monitoring sentiment
What is your customer’s experience?

Keep an eye out for my next post in the “App in a Box” series on Customer Perspective.  If you are ready to take the leap just skip the series all together and experience Dynatrace for yourself. Discover how much more there is to know about monitoring your application stack.

Stay updated