As part of my Share Your PurePath and Performance Clinic initiatives I get to see lots of interesting problems out there. This time I picked two examples that just came in this week from Balasz and Daniel. Both wanted my opinion why their apps show high response time contribution to their web requests coming from worker threads that seem to be either in I/O or in a Wait state. The question was what these threads are waiting for and whether this is something that could be optimized to speed up these slow response times they see on some of their critical web requests

For both apps it turned out that the developers chose to “offload” work items to a pool of worker threads. This is of course a very valid design pattern. The way it was implemented though didn’t fully leverage the advantage that multi-threading can give you. We identified two patterns that I will now describe in more detail in the hope that your multi-threaded applications are not suffering from these performance anti-patterns:

  1. Sequential instead of parallel execution of background threads
  2. Many parallel background threads using “non-shareable” resources

I also want to take the opportunity to give you some tips on how to “read” Dynatrace PurePaths. Over the years I developed my own technique and I hope you find my approach worthwhile to test on your own data.

Pattern #1: Sequential Execution of Threads and a bit more …

I already tweeted last week. I created the following slide for Balasz to present to his development team including tips on what I look for when analyzing a PurePath:

PurePath showing that the main web request thread executes 3 RMI calls followed by two background threads where the second one actually starts when the first is finished: that’s sequential execution
PurePath showing that the main web request thread executes 3 RMI calls followed by two background threads where the second one actually starts when the first is finished: that’s sequential execution

The many callouts I put on this slide might be a bit overwhelming. Let me explain the highlights.

The overall request response time as measured on the web server is 9.896s. The contributors to this response time are pointed out in the callouts that I numbered. Here is short version of it:

  1. The main service request thread on the Java servlet container takes 3203s where 1.770s is almost entirely spent in I/O.
  2. The main service request thread makes 3 RMI calls. The “Elapsed Time” (=Timestamp on when that method was called in relation to the start of the request) shows us that the first one executes after 366ms followed by the second and third in sequential order
  3. After the 3rd RMI call – at exactly 1.800s Elapsed Time the first background thread is started. This thread takes 1.429s to execute.
  4. The second background worker thread was supposed to run in parallel – at least based on what Balasz told me. The Elapsed Time column however shows that it was executed AFTER the first background thread was done. This was not intended that way. It also explains the 1.770s that show up as I/O time on the main service thread. It is mainly the time that the main thread waited until the first background thread was done just to kick off the second background thread.
  5. The second background thread was now really executed in parallel letting the main service request thread finish. This second background thread generates and writes the generated HTML using the HttpServletResponse Output Stream. This also explains the long waiting time on the Web Server as it was waiting 6.690s for that asynchronous background thread to write data to the output stream.

Lesson Learned: Verify if your threads execute as intended

Looking at the actual time stamp (=Elapsed Time) tells you when your threads really get started and whether they execute in parallel or not. It’s a great way to verify your actual implementation and lets you learn which component is waiting for which other component.

Pattern #2: Too much parallel work causes locks

The second scenario from David was very interesting as it showed that when you try to break up a large chunk of work into smaller pieces and execute them in parallel you need to be aware of any cross impact. In this case the cross impact was updating/accessing the same database tables from 13 background worker jobs. The PurePath nicely shows how the ASP.NET OnInit method creates 13 different background threads and starts them pretty much all at once (look at Elapsed Time). At the end of the PurePath we see the main thread waiting for all threads to be complete using a Wait on a Handle:

Very common pattern: multiple parallel worker threads spawned by main thread which waits for all of them complete.
Very common pattern: multiple parallel worker threads spawned by main thread which waits for all of them complete.

Now – this doesn’t look too bad as it is a perfectly fine architectural pattern. You need to be aware though about the total number of worker threads you will need considering many incoming requests, each associated with its own main thread, during peak load. You also need to make sure these threads are not accessing “exclusive” resources which would cause cross impact. This is actually exactly what happened in this scenario and it can be seen by taking a closer look at this single PurePath.

Expanding the PurePath tree to see what is going on in these background threads reveals that these threads are all INSERTING data into the same tables using complex statements with nested SELECTS. In other words: they are accessing an “exclusive” resources, as parallel complex INSERTS and UPDATES with nested SELECTS on the same database tables can lead to table locks which would block such parallel accesses and thus slow the whole operation down. The PurePath shows that each SQL statement got slower:

Each thread INSERTS data into the same table using a nested SELECT. This seems to cause locks on the database and slowing down these INSERTS the more executed in parallel
Each thread INSERTS data into the same table using a nested SELECT. This seems to cause locks on the database and slowing down these INSERTS the more executed in parallel

The more parallel threads are accessing “exclusive data” the more time it takes for the database to complete these queries. A closer look at these statements with a focus on how to optimize the calls and reducing any potential impact of database triggers is still on the list for David’s team.. Right now this access pattern essentially results in the same overall performance as doing all of these INSERTS in a single thread – and actually saving a lot of threads.

Another look at the Dynatrace Database dashlet shows us the individual SQL statements and how the execution times of these statements got longer and longer:

It’s easy to spot how each individual INSERT statement became slower the more parallel threads where hitting the same table at the same time.
It’s easy to spot how each individual INSERT statement became slower the more parallel threads where hitting the same table at the same time.

Lessons Learned: Watch out for shared resources

If you break up large work items into smaller junks that you execute in parallel make sure that the resources you require (e.g: database, shared data set) can handle parallel work and are not “exclusive.” Updating the same database tables makes your database the shared resources that are voiding your efforts in making your algorithms faster.

Do you agree with my findings?

I am very much interested in your thoughts on these two use cases. Do you agree with my findings? Do you see different problems? How do you leverage background threads and parallel execution? How do you analyze your implementation?

If you have never done that type of analysis on your application feel free to get your Dynatrace Free Trial and Share Your PurePath so that I can help you with the fact finding.