Friday morning at 7:40AM we received the first error from our APMaaS Monitors informing us about our Community Portal being unavailable. It “magically recovered itself” within 20 minutes but just about an hour later was down again. The Potential Root Cause was reported by dynaTrace which captured an Out-of-Memory (OOM) Exception in Confluence’s JVM that hosts our community.
The first thought that came to mind when looking at our Real User Data was: “We are under attack”. Why? Because at the same time the OOM happened we saw a huge increase in incoming web requests to our Application Server.
A further analysis step – as discussed in more detail later in the blog – revealed that “the attack” was not from the outside but actually from the very familiar IP Address 127.0.0.1! It turned out that our Google Connector for Confluence runs on the same machine as our Confluence instance and was scheduled to run at that time period crawling both pages and attachments. This job has been running for weeks but never had such an impact.
The OOM was caused by a full re-index run in a combination with a very large page attachment that was recently uploaded. The SOAP API used to query this very large attachment caused the JVM to go OOM. The Axis SOAP Stack used by Confluence couldn’t allocate enough memory to construct the Base64 Encoded version of the attachment for the SOAP Response Message. That’s why the JVM ran into an OOM and stalled the JVM for about 20 minutes until the Garbage Collector was done with the work and the JVM continued “normal” operation.
Today, I will highlight the steps we took to analyze the root cause, figure out which of our end users were impacted, what data we looked at, what measures we took to solve this problem and the lessons learned in optimizing our deployment to not crash or show performance problems under peak loads.
Step 1: What Alerted Us?
As mentioned above, we have a couple of Dynatrace APMaaS (formerly known as Gomez) synthetic monitors checking the availability of the core use cases of the community: homepage, search, forum, documentation, etc. What was important for us was that these tests really clicked through individual pages and not just checked if the URL could be pinged. That alone wouldn’t have worked because our web servers defaulted to a “We are sorry but we are in a maintenance window right now” message but with an HTTP 200 OK code. So – checking a full click scenario and validating returned content is a key requirement for synthetic monitoring.
At the same time we also received an Out-of-Memory Incident captured by Dynatrace which is monitoring our application, web and database servers.
A closer look at the JVM Process Health Metrics lined up with the synthetic data showing that this problem happened twice in the morning hours. While the JVM was trying to recover it spent a considerable amount of time with Garbage Collection.
Step 2: What Caused the Crash?
A first step to diagnose the problem was to figure out which requests were going on when it happened. A simple drill down from the OOM Exception that happened in that timeframe to the actual request that ran into that exception gives us all the details we need. It’s a Confluence SOAP API to query page attachments. The following screenshot shows the PurePath which includes all context data like SOAP URL (/getAttachmentData), Parameters (Confluence Page, Name of Attachment) all the way to the code reading the file into memory and the code causing the out of memory exception by allocating a large string array for the Base64 Encoded version of that attachment:
Step 3: Who is Making These Calls?
So – who is pulling attachments through the SOAP API? The PurePath also contains the Client IP address where this call comes from. It showed 127.0.0.1! Now – that was interesting. Additionally it showed that the user agent string is Java which means it is probably some Java app running on the same machine as Confluence. The caller was quickly identified. We run a Google Connector for Confluence on the same box that feeds data into our Google Search Appliance. It is currently deployed on the same machine even though we know that this is not best practice because it takes away CPU and Memory on that host. The following screenshot shows our Server Timeline Dashlet which makes it easy to perform some fancy analytics on our PurePaths. It shows how much traffic is coming in from 127.0.0.1 as compared to our external traffic:
Before going back to our Admins I wanted to get more data to give recommendations on how to fix this problem. First I found out that the problematic attachment was a really large file which was just recently uploaded. It made me wonder whether it makes sense to really feed every single attachment to Google or whether we should focus on attachments that really make sense to be indexed as the file in question has a custom format that doesn’t make sense to be index anyway. The second thing I wanted to know is how often the Google Connector queries confluence for data to index. The following screenshot is a dashboard I pulled together showing the requests coming in on that web service vs the regular requests coming in on the community.
It is easy to spot that the Indexer is hitting us with a lot more requests during its indexing times than I would have expected. We should try to “flatten” this index behavior to not put too much stress on our Confluence Server.
Step 4: Who is impacted?
Besides relying on our Synthetic Monitors that help us with Availability Monitoring and Performance Baselining we also have Dynatrace UEM (User Experience Management) monitoring real users and the performance experienced in their browsers. As we know when the crash happened it is easy to lookup the visitors that were on the Community at that time. The following shows a dashboard we can use to identify who is impacted, where these users are from and what they did when the site became unavailable:
For me as the Community owner this list of visitors is gold as I can pro-actively reach out and apologize for any inconvenience they had during that time.
Lessons Learned and resulting Deployment Changes
We had several lessons learned from this incident. Here are the steps we took to optimize our environment for both optimized Google Connector Job Execution as well as even better monitoring:
- Schedule Background Jobs to have the least impact: In our case we throttle the Google Indexer to flatten out the extra load it puts on our system
- Limit Attachment Sizes: This is true for both Google Index Job as well as our end users that upload files. In our case we already received a patch from the company providing the Google Connector. They added a new filter option on file sizes after we told them about this problem.
- Deploy Background Job on different machine: In order to not take away CPU and Memory from our Application Server
- Add additional monitoring on these SOAP calls to automatically detect usage anomalies. Why this is necessary? Because just focusing on your end users is not enough if you have large load coming in from other channels.