This story came in from Joseph – one of our fellow Dynatrace users and a performance engineer at a large fleet management service company. Their fleet management software runs on .NET, is developed in-house, is load tested with JMeter and monitored in Production with Dynatrace. A usage and configuration change of their dependency injection library turned out to dramatically impact CPU and memory usage while not yet impacting end user experience. Lessons learned: resource usage monitoring is as important as response time and throughput!
On Wednesday, July 3rd Joseph’s ops team deployed the latest version into their production environment. Load (=throughput) and response time are two key application health measures the application owner team has on their production dashboards. Especially after a new deployment an unchanged behavior is what you are looking/ hoping for:
Joseph is an experienced performance engineer, so he knew to check more than just these two metrics. Looking at host and process monitoring data he identified a 6x increase in CPU and 40% increase in process memory usage which alarmed the folks responsible for the infrastructure that hosted the ASP.NET Services:
Despite the fact that the app still shows the same response time and throughput this is very alarming. Why? Because it means that the application is much more expensive to run as compared to the previous build.
Whether it runs on physical, virtualized or in a cloud environment: more resource consumption means higher operational costs. And – who knows what happens during peak load: will CPU and memory be enough or will it eventually impact response time and throughput? That’s the reason why analyzing changes in resource consumption is as critical to performance engineering then optimizing response time and throughput.
If you are a performance engineer, application owner or developer I suggest you read the rest of the blog to learn from Joseph’s best practices on analyzing and dealing with situations like this and what they changed in their development and deployment process:
Step 1: Analyzing the Impact of Memory Usage on CPU
Runtime engines such as the JVM (more on Java Enterprise Performance) or the .NET CLR (more on .NET Performance Analysis) provide detailed performance counters to start analyzing the resource usage hotspots on CPU and memory. A good start is to look at memory usage in the different heap generations as well as the activity of the garbage collector. The following screenshot shows the .NET garbage collection generations (# of Collections and Heap Size) as well as the time spent in garbage collection metrics of their production environment:
What can you read from these two graphs?
- Most of the objects live in generation 0 and never get promoted to an older generation. That means we are talking about short living objects that are probably created and cleared on a per request basis. Would we see a shift to older generations it is more likely that these objects are part of a cache
- We don’t seem to have a memory leak as the garbage collector seems to clear objects from Gen 0 and not promoting them to higher generations which could eventually lead to an out of memory crash
- The increase in # of collections as well as time spent in garbage collection indicates that the application either uses more short living objects or these objects grew significantly in size
Step 3: More Objects or Larger Objects or Both?
Next step is to find out what has changed in terms of objects on the heap. Joseph therefore takes regular Dynatrace memory snapshots (YouTube Tutorial) and keeps them to later compare heap usage patterns between releases. The next screenshot shows the difference between the previous version and the current version and highlights that there are not only about twice as many ASP.NET MVC controller objects on the heap but these objects are also about 6x larger than before:
It turned out it that this pattern was not exclusive to the PlanningController shown in the screenshot but happened across all ASP.NET MVC Controller objects. Following the Deep Memory Leak Snapshots best practices, Joseph found the actual referenced objects that caused these controller objects to grow in size. It turned out to be related to objects that were instantiated and referenced using dependency injection.
Finding and identifying the problem was the first step which raised two more questions:
a) Why wasn’t this found earlier?
b) Which code change caused this problem?
Question #1: Why Wasn’t This Found Earlier?
Performance engineering is a discipline for the whole development lifecycle. Along the delivery pipeline in a fully automated deployment process you need to execute tests that allow you to analyze not only functionality (via unit and functional tests) but also any performance or resource regressions. Joseph and his team admit that they don’t always get to run all these tests before deploying a release nor did they look at all these memory related metrics. That’s why this problem wasn’t detected before deploying it into production. Want to learn more? Check out the following continuous delivery blogs (Metrics for Continuous Delivery) and slides (Sydney Continuous Delivery Meeting)
Question #2: Which Code Change Caused this Problem?
Many code changes across several builds made it into this new release. In order to figure out which code change was responsible for the changed memory behavior Joseph ran load tests against 5 different builds. He picked builds where he identified major changes in the controller implementations based on the code change sets in their configuration management system. Using JMeter to generate the load and Dynatrace to analyze the performance metrics showed him in which of these builds memory behavior started to change. The following shows several key metrics he looked at for every build. These graphs make it easy to spot in which builds memory behavior changes got introduced:
It is easy to spot that GC executions, time spent in GC and overall CPU time consumption jumped with the changes introduced with build 1.7.14233. More changes got introduced with the builds 14235 and 14252. This allowed Joseph to narrow down the code change set with which he approached the development team. Looking at the code changes and analyzing PurePaths and memory snapshots on their local development environments showed them that the problem was related to usage changes of their dependency injection (DI) library. PurePaths with memory allocations showed them how their DI configuration changes impacted actual code and memory allocations. They optimized the settings and now have the same resource usage as with the previous release.
Lessons Learned: Performance Engineering Must be Continuous and More than Just Response Time
The big lessons learned – and with that the new best practice for Joseph and team are:
- No need to wait for problems like this to happen in production as these patterns can be detected earlier by looking at some key metrics.
- Developers now use Dynatrace AppMon Free Trial on their local machines to better understand code execution and memory allocations done by 3rd party frameworks.
- Test Automation now runs small scale load tests and monitors memory usage and GC Impact on every build
Benefit: With that new approach they eliminate the tedious and time consuming work of re-running tests to identify the code change that introduced a problem.
I am sure there are other stories like this out there. Feel free to share it with us.