Application logs: How Dish Network leverages Log Analytics to resolve performance incidents

The importance of integrating log data with performance metrics and events

Earlier this year we spoke a lot at Perform 2018 about the importance of integrating troubleshooting information contained in application logs with measurements, events, and other performance data related to end user experience and application performance. Typically in the past, log management was done by separate, specialized tools. Sometimes, these tools were also able to pull unorganized, unrelated performance metrics and show them next to information from logs.

That was good enough for single monolithic applications, but with the recent proliferation of dynamic, elastic architectures based on microservices, containers, and cloud, keeping track of logs and finding the right ones required for troubleshooting becomes a tedious manual task.

Finding the right logs for troubleshooting

At Dynatrace, we realized there must be a better way and so we built a log management functionality that ties to and directly leverages information about topology, relations, and dependencies in your application architecture and auto-detects and presents data from logs generated by affected components.

What is Dynatrace Log Analytics?

Dynatrace OneAgent automatically discovers and connects logs written by application processes, regardless of whether they use local files on disk, write to a Docker logging driver, or are streamed by the Syslog protocol (support for the Syslog protocol will be available in an upcoming release). The same monitoring code measures application performance, ties this to processes, attaches log entries to it, and detects if anything alarming was present in logs during the problem time frame.

Dynatrace Log Analytics

Everything is presented on a single page (see the Log viewer example below), where you can see full problem analysis with impact and root cause as well as inspect appropriate logs with a single click. No configuration or know-how of application architecture is required.

To access the Log viewer, select Log files from the navigation menu and click the Analyze log files button.

Dynatrace log file viewer

Log information at your fingertips saves time

Dish network logo

Here’s an illustrative example: Dish’s operations team received alerts from Dynatrace about a new problem starting to develop in their credit qualification application infrastructure. A quick look at problem details showed that there were multiple services with higher than normal response time. It wasn’t immediately clear what caused the high response times, but a closer look revealed that a middleware process supporting these services had an error message appearing in the logs (highlighted in the images below). This error message is a known result when related third-party services do not work properly.

Dish network list of impacted services

Error log pattern for affected Dish service

Dynatrace identified an increase in response time for the CreditQualification service and created a problem after correlating multiple events that were affecting different services. While looking at the problem details and viewing the Visual Resolution Path, it was determined that, if not for Log Analytics capturing these errors/exceptions, Dish would not have known what was causing the high response times. Log data captured by Log Analytics was the only way they could accurately identify this specific issue.

Error log pattern for affected Dish services

The Dish operations team shared their findings from Dynatrace with the third party to take action and resolve the problem before it started to significantly impact systems.  Within a few minutes, the third-party service was restored to normal operation.

Here’s how the story looks from the perspective of Jonathan Kennedy, manager of the Middleware Applications Administration team at Dish:

“There were no other indications that this third-party service was having a problem apart from the errors in the log files. That is what made the log analysis valuable in this case. We configured a custom log event to look for the known error that can occur when the third-party service is not functioning properly.

Setting up detection rules for custom log events

Dynatrace alerted us to the problem when the error messages started appearing in the logs.  We know that when this specific error appears, we need to contact the third party to investigate.

Without Dynatrace Log Analytics catching these errors, the problem would have continued until our back office teams noticed, which could have taken days or weeks.  We contacted the third party, and they confirmed an issue on their side, which was quickly resolved, saving us time and money.”

Summary

We all know logs can contain a wealth of useful information. They are valuable not only for ad-hoc, post factum analysis, but also as a source of real-time application performance information streamed to APM systems and correlated with classic metrics like response time. But we need to use this information intelligently to avoid wasting time and resources on analyzing, browsing, and searching through an ocean of unrelated data just to find (or not) a few lines that we think might be useful for solving a problem.

Using information intelligently

Log data should be provided in context, only from the affected application architecture component, and from the problem time frame. This requires automation, auto-discovery, zero-configuration, and intelligence that can connect the dots to help us find the information we need. That’s Dynatrace Log Analytics.

Stay updated