Header background

Automated incident management and remediation drive organizational resiliency

With automated incident management capabilities now woven into its IT, Dynatrace customer Parker Hannifin has reduced system outages and recovery time.

Incident management and wrangling with remediation tools used to dominate the time of the IT staff at Parker Hannifin (Parker), a motion and control technology leader. “We had a lot of technical debt and outages,” says Venkatesh Harikrishnan, the Parker team’s enterprise resource manager.

Like most IT professionals, Harikrishnan and the Parker team’s cybersecurity engineer Tom Hood have multiple requirements to manage, from implementing digital transformation and ensuring operational efficiency to securing IT systems. But the Parker team transformed their organization from being controlled by technical debt and system downtime to providing a friction-free user experience across its global digital platform. In a breakout session at Dynatrace Perform 2022, they explained how.

Incident management to avoid outages

Unplanned system outages are costly, in both time and resources. Indeed, according to recent Uptime Institute data, outages have become far more expensive over the past several years. More than 60% of survey respondents reported losing more than $100,000 to downtime.

That’s why organizations like Parker Hannifin need to be able to proactively monitor incidents before they escalate into outages. With an observability platform powered by contextual software intelligence, organizations can gather data on incidents in real-time and proactively. That enables them to weave automated incident management into their IT systems and prevent outages before they occur.

For Harikrishnan and Hood, technical debt and unplanned outages were having an impact on the customer experience. They knew they needed to improve mean time to resolution (MTTR), but their siloed tools created bottlenecks and provided limited visibility into areas they could improve.

incident management technical debt - Perform 22
Technical debt was causing multiple problems for the IT team at Parker.

Implementing Dynatrace’s automated incident management has had a “huge impact,” says Harikrishnan. Here’s how they did it.

Initial instrumentation to tame legacy remediation tools

The Parker team’s first step was installing Dynatrace OneAgent into a Tier 1 production application. Once monitoring began, Dynatrace provided early-warning signals of a potential outage on customer-facing digital assets. But they didn’t trust the results at first. “We didn’t really buy into it,” Harikrishnan said. “We didn’t listen to it and it led to a serious issue in our environment.”

Dynatrace for incident management and trust over legacy remediation tools
Dynatrace cemented its value early on by identifying an issue their legacy remediation tools missed.

After teams spent a substantial amount of time using legacy remediation tools to diagnose the root cause, they discovered that the Dynatrace platform had proactively identified the source of the problem. That helped convert the skeptical about Dynatrace’s capabilities. “It created much more buy-in of Dynatrace and Davis AI behind the scenes,” Hood says.

Automatic action aids incident management with early warning signals

The increased buy-in led to organization-wide change. To understand the state of DevOps and measure themselves against the industry standard, the Parker team assessed a current Apdex score and created a target end-of-year goal.

On the technical side, change meant “making sure alerts were meaningful and went to the right personnel,” explains Hood. Parker integrated Dynatrace with an existing enterprise alerting tool to automatically route notifications directly to the product team members, resulting in faster resolution and improved team communication.

scope of Dynatrace incident management coverage, Perform 22
Dynatrace AI and automatic incident response have improved operational efficiency for the Parker IT team.

Dynatrace raised, then resolved alerts automatically through the platform. These enhanced incident management capabilities helped eliminate outage problems and gave time back to team members. “We made a significant improvement in operational efficiency — and we’re not always in a war room scenario,” says Harikrishnan. “Outages are now often prevented and MTTR has reduced significantly.”

Automated incident management to rebuild confidence with customers

Automating alert notifications provided bandwidth for the team at Parker to improve customer experience. “We can now slice and dice the data Dynatrace Davis AI collects and monitors for each and every user action that happens across our global digital platforms,” Hood says.

Insights gained by analyzing the data collected through Dynatrace include the following:

  • User demographics. Where is a user coming from? What device are they using over what type of network?
  • Usability. Are our digital assets too large?
  • Accessibility. How does the system behave when a user is coming from a low bandwidth area using a mobile device and is looking for a huge technical specification?
  • Efficiency. How much time does it take for a user to access and download information?

These insights were not previously available to the Parker team. “We had the data, but no time to take a deep dive into that data,” Hood explains.

Dynatrace provides benefits for incident management and reducing legacy remediation tools

Part of incident management is troubleshooting operational bottlenecks that affect overall system performance. Two technical issues the Parker team struggled with prior to implementing Dynatrace were garbage collection and time-consuming maintenance on a high-value business-to-business portal.

Garbage collection

“We had a substantial issue with garbage collection,” Hood says. In a conference call with a third-party vendor, the Parker team shared the issue onscreen wherein Dynatrace was able to identify the root cause. “We were able to…fine-tune our systems in a very performant way,” notes Hood. “Our brain, the source of that knowledge, was the information coming from Dynatrace.”

B2B portal

Parker’s B2B portal once required continuous maintenance. Now, Dynatrace automatically alerts the Parker team before an issue occurs and runs a simple reboot of the legacy platform. “That started us down the path of looking at auto-remediation,” Hood says. “We thought, ‘Why can’t these issues be solved autonomously with a platform that has information about these processes?'”

Automated remediation for known risks

The fourth step in Parker’s Dynatrace journey was initiating automated problem remediation. “We wanted to fully automate process restarts based on bad trending as opposed to waiting for problems to occur,” Hood explains. “We used Davis AI to identify the root cause and provide the intelligence…to restart the process on the correct host and validate success with a series of post-checks.”

automated remediation
Davis AI and automatic root-cause analysis have paved the way for automated problem remediation.

Adding Dynatrace has introduced major improvements to the Parker team’s entire incident management workflow and profile of remediation tools.

“There is a drastic difference in MTTR from manual to automated remediation,” says Harikrishnan. “And now that we don’t need to babysit the servers, engineers can work on real-life problems they want to solve for our customers.”

The future of Dynatrace and Parker collaboration

Parker aims to improve its digital customer experience and create a benchmark system that meets industry standards—one that can confidently make updates without affecting the performance or resilience of the system.

“Dynatrace is a significant player in our tool kit moving forward to achieve our goals,” Harikrishnan says.

For more about the Parker team’s experience, check out Harikrishnan and Hood’s full breakout session, Drive production resiliency through automatic incident management and remediation.

Watch this session to learn how Dynatrace’s automatic problem detection can integrate with ITSM and remediation tools like Ansible to trigger repeatable remediation workflows and help you regain time.