As a performance consultant, I get called on to address various performance issues. One of our recent scalability issues happened on a micro service exposing a REST API. The micro service runs on Apache Tomcat, on an AWS EC2 instance, in a VPC. It uses ehcache as in-memory cache and Dynamo DB as persistent data source. DynamoDB gets updates from the source system using data pipeline built in Kinesis and Lambda functions.
In this blog, I’ll walk through the steps taken by our performance engineering Melchor to analyze this scalability issue in our Performance Test environment, which tools were used, and how this problem was resolved. I hope you find this useful!
API Scalability Analysis in Performance Test Environment
Price API has a very tight SLA. 99% of requests processed within a minute must show a Response Time of < 20ms under a load of 2000 tps (transactions per second). To keep the number of EC2 instances and number of DynamoDB calls low, we decided for memory optimized EC2 instances and increased the JVM heap size to 100 GB to cache 80-90% of SKU price details in JVM. In performance test, we noticed that we can only meet this type of SLA if all request are served from the cache. During cache warmup, or in case more than 10% of items are not found in cache (Cache Misses), the service would miss its SLA.
The following diagram visualizes our service flow. Consumer API’s calls Price API to look up prices for multiple items and different location ids. The Price API validates if the requested data is in the ehcache. If not it pulls the data from DynamoDB using AWS DynamoDB Client library.
To monitor individual service health, we log entry and exit calls of each service invocation in Splunk. We can see how much time is spent in Price API and DynamoDB calls. We also look at AWS CloudWatch metrics to validate response time from DynamoDB. We ensure that DynamoDB query time is between 3-4ms and that DynamoDB has enough read/write capacities as well.
Application Monitoring with Dynatrace AppMon
The Price API team also leverages Dynatrace AppMon for end-to end-transactional tracing and deep dive diagnostics. It is the team’s tool of choice because Dynatrace AppMon is already used for live production performance monitoring of all our services in production. Melchor used Dynatrace AppMon to analyze a spike in response time above the accepted SLA, as neither CloudWatch nor the custom-built logging via Splunk provided an answer to the Price API Team.
Next, let’s walk through the steps in Dynatrace AppMon to see how we identified high response time and its root cause. In case you want to try it on your own I suggest you:
- Get your own Dynatrace Personal License
- Watch the Dynatrace YouTube Tutorials
- Read up on Java Enterprise Performance
Issue #1: Time Spent in RandomUUID method of servlet API
Once Dynatrace AppMon collects data you can decide whether to analyze in the Dynatrace AppMon Diagnostics Client or go directly to the Dynatrace AppMon Web interface. In the last blog on Hybrid Cloud Patterns we showed how we analyzed our PurePaths in the Web Interface.
In today’s example, we stick with the Dynatrace AppMon Diagnostics Client as we will perform thread dump analysis which is better to be done in that user interface.
Step 1: Analyze PurePaths to understand hotspots
Dynatrace AppMon captures every single PurePath of every single request that was executed. In our scenario we rely on the 100% transactional coverage because most of our transactions we consider slow (>20ms) are considered fast by other tools in the APM space. Other tools would therefore not capture all the details we need to optimize our critical transactions.
In Dynatrace AppMon we typically start by opening and looking at a PurePath. In the PurePath Tree there is a neat option that is called “show all nodes”. Now we not only see critical methods based on Dynatrace’s hotspot algorithm but we get to see every method executed including its execution time contribution and whether that time was spent on CPU, Sync, Garbage Collection or I/O. The following screenshot shows that extended PurePath Tree and it is easy to spot that the method taking most of the time was the nextBytes method. This method already spent 53.33ms getting a randomUUID in our servlet execution, without even reaching the business API code. Remember – our API SLA is 20ms – so we are already more than twice over the limit. We can also observe that nextBytes spends 95% of its time waiting to enter a synchronized code block instead of actually executing code!
Step 2: Thread Diagnostics to understand dependencies
At this point, we decided to take thread dumps and determine why nextBytes method in SecureRandom class is taking that much time in sync.
Fortunately, Dynatrace AppMon comes with a built-in thread dump analysis feature. Thread dumps can either be triggered on demand, scheduled or triggered by an event. After we executed a thread dump we could immediately see what all threads were doing, and whether they are blocked by other threads.
It turned out that many Tomcat http-nio and threadPoolTaskExecutor (used for calling DynamoDB asynchronously) threads were blocked because of a single thread executing nextBytes, which is a thread safe synchronized method. All the incoming traffic will pass through this bottleneck since getting a secure SSL connection will use nextBytes (synchronized method) to obtain a secure random thus blocking Tomcat threads.
Also all async threads that call DynamoDB (threadPoolTaskExecutor) will end up blocked since AWS DynamoDB client library requires a randomUUID, and will use the same secure random implementation defined in the java.security of Tomcat.
The Price API also accepts multiple SKUs in one HTTP request, but queries DynamoDB for each SKU in single get requests (sounds like the classical N+1 Query pattern that Andreas Grabner has been talking about). During the services warm up phase, or when we see more than 10% cache misses, the number of nextBytes method invocations increases exponentially by both Tomcat’s and the async threadPoolTaskExecutor threads. Since nextBytes is a synchronized thread safe method we see a huge increase in wait time for all other concurrent invocations of nextBytes. Simliar to the PurePath tree, we can also analyze the full call stack for each thread in the dump – showing us who is really calling into these synchronized methods.
Dynatrace also provides a useful feature of “decompiling source code”. Right from the PurePaths, or from the Thread Dumps, we can get the decompiled version of every single method on the call stack. The following shows us the synchronized nextBytes method:
Solution to Issue #1: Time Spent in RandomUUID method of servlet API
We did some digging in the source code of JDK 1.6. Turns out that SecureRandom will seed itself /dev/random or /dev/urnadom. We used strace to identify which source was used in our case. It was /dev/random/. If you want to learn more about this please find more details in the following two links: https://linux.die.net/man/4/random, http://man7.org/linux/man-pages/man4/random.4.html
How we Fixed it
In the $JAVA_HOME/jre/lib/security/java.security configuration file we changed the securerandom from /dev/random to /dev/./urandom which is much faster and does not block the threads as easily as random does.
This can also be achieved by adding the following parameter in the JVM command line
This change allowed our API to operate within define 20ms SLA because we completely eliminated the synchronization overhead!
Issue #2: AWS DynamoDB client metadata cache
After fixing the RandomUUID bottleneck we soon started to see blocked threads again. This time for a different reason. The approach to identify it was similar though.
Step 1: Thread Diagnostics
We went back to creating Thread Dumps using Dynatrace AppMon, which quickly showed us why threads are getting blocked. This time it was due to the add method in ResponseMetaDataCache class in AmazonDynamoDB client library.
Solution to Issue #2: Time Spent in AWS dynamoDB client
The default behavior of the Amazon AWS Http Client libraries is to cache response metadata for troubleshooting. For more details check out setCacheResponseMetadata in the AWS Doc.
We changed the behavior to false to prevent this bottleneck when we made calls to DynamoDB through the Amazon Client Library.
Performance after both fixes
After implementing the two fixes described above, Price API could handle peak load within SLA. Thread Dumps during the test showed no blocked threads as well.
And the PurePaths also looked much better!
Thanks again to Melchor for sharing this story. It shows us that good performance engineers not only understand how to analyze performance issues, but also work with the underlying frameworks and the engineering team to come up with the right solution. It also showed us that even though we built custom log-based monitoring we could only find and fix it thanks to Dynatrace.
If you want to learn more about how Dynatrace can help feel free to get your own Dynatrace AppMon Personal License or try our Dynatrace SaaS offering with Full Stack Cloud and Container monitoring support.