What's the difference between problems and events?

Problems

Problems are used to report and alert on abnormal situations. Abnormal system behavior within complex environments typically results in a flood of individual events that share the same root cause. To avoid event and alert spamming, the Dynatrace AI correlates all individual events that share the same root cause into a single, trackable problem.

A "problem" in Dynatrace includes the AI-driven analysis, environmental context, root cause analysis, and other details provided for one or more incidents in your environment. Problems can express themselves in your environment as performance degradations, improper functionality, or lack of availability. A problem can be the result of a single event or multiple events. Dynatrace problems show impact and root cause and allow you to replay all collected events that are correlated within a problem. Problems have defined lifecycles and are updated in real time with all incoming events and findings.

Events

Events are used to indicate many different types of individual incidents, such as metric-threshold breaches, baseline degradations, and point-in-time events (for example, process crashes). Dynatrace also detects and processes informational events such as deployment events and configuration changes.

Within Dynatrace each event type has a defined severity level that indicates the significance of the incident. Resulting problems aggregate all included event severities. Based on a problem's lifecycle, a problem might raise its severity level (for example, a problem might begin in slowdown level and then be automatically raised to the availability level when an outage is detected).

A problem’s severity level can increase over time but never decreases to a lower severity level. The defined severity levels are:

  1. Availability: Availability events indicate a severe incident within your environment, such as a complete outage or unavailability of servers or processes. These event types have the highest severity level.

Availability events

  1. Error: Error events are used to inform you of increased error rates or other error-related incidents that interfere with the regular operation of your environment.

Error events

  1. Slowdown: A slowdown event indicates a decrease of performance in one of your operational services or applications. Slowdown events are less severe than error or availability events. Nevertheless, they inform you of potential issues with the performance of your services.

Slowdown events

  1. Resource: Any situation that leads to resource contention is reported as a resource event. Typical examples are CPU saturation and memory saturation events.

Resource contention events

  1. Custom alerts: Custom alerts and severity levels are used to enable alerting on any user-defined threshold. Custom alerts for user-defined thresholds can be set for any Dynatrace metric. Custom alerts aren't correlated or modified by the AI though they are automatically alerted on.

Custom alerts

  1. Info: Informational events are a specific type of event that is used to report manual events that don't result in the creation of a new problem. These events are used to mark important deployments or configuration changes as well as administrative events such as the automatic migration of a virtual machine. Informational events aren't sent out as alerts and no problems are opened as this type of event doesn't indicate an abnormal situation.

Understand important event types

Following is an overview of the most important event types, their severity levels, and the logic behind raising them.

Availability events

Unexpected low traffic (UNEXPECTED_LOW_LOAD) Dynatrace collects a multidimensional baseline for application and user action traffic and therefore learns the typical traffic pattern of all your applications and user actions. Alerting on abnormally low application traffic is enabled by default as those situations can indicate full application outages! Refer to the application anomaly detection settings under Settings > Anomaly detection > Applications to adapt the sensitivity of low-traffic alerting. If enabled, Dynatrace follows a time-interval comparison pattern (daily or weekly) in alerting on abnormal traffic situations. The actual monitored application traffic is compared with the previous interval and alerts if the comparison reports unusually low traffic.

The image below shows a typical example of an unexpected low traffic event.

Unexpected low traffic event

Host or monitoring unavailable (OSI_UNEXPECTEDLY_UNAVAILABLE) This event is detected when a host is abruptly shut down or Dynatrace loses the network connection to the host’s OneAgent. If the operating system is shut down regularly, Dynatrace won't open a problem. Dynatrace will show the host in an unavailable state.

Condition:

  • Network connection to the monitored host is lost unexpectedly while OneAgent and host are still running. The connection must be lost for more than 5 minutes, before OneAgent starts sending signals again and cached metric data fills in the missing chart data.
  • OneAgent isn't able to catch and send the regular operating system shutdown message.
  • Decommissioning of virtual hosts (for example, in AWS auto-scaling groups) when the operating system isn't shut down and therefore Dynatrace detects this as a connection-lost event.

Closing conditions:

  • Event is resolved when the host is available again
  • Timeout of event occurs after 5 days for hosts and after 3 days for virtualized hosts.

An example host-unavailable event is shown below.

Host unavailable

Process unavailable (PROCESS_UNAVAILABLE, PROCESS_GROUP_LOW_INSTANCE_COUNT) Dynatrace OneAgent automatically detects each running process on hosts and reports the availability state of all processes. As many hosts run volatile processes that are restarted on a regular schedule, Dynatrace doesn't open a problem for each process shutdown by default. However this event is shown as soon as any other triggered event is raised. Alerting on process unavailability is an opt-in setting available via process-group configuration. Process-group configuration allows you to choose between three options:

  1. If service requests are impacted: Open a process unavailable event only if Dynatrace detects active client requests hitting the selected process.
  2. If any process becomes unavailable: Open the event if any of the processes within the process group become unavailable.
  3. If the minimum threshold isn't met: Open the event if a minimum number of running processes within that process group isn't met within an observation period of at least 2 minutes.

Process group availability setting

A typical example of a process unavailable event is shown below.

Process unavailable event

Synthetic monitor global outage (WEB_CHECK_GLOBAL_OUTAGE) Synthetic monitors check your Websites from multiple geographic regions on a regular schedule. Dynatrace raises a synthetic global outage event if your website stops responding from all configured geographic regions.

See an example synthetic-monitor global outage event below.

Synthetic monitors global outage

Synthetic monitor local outage (WEB_CHECK_LOCAL_OUTAGE) Synthetic monitors regularly check your Websites from multiple geographic regions. Dynatrace raises a synthetic local outage event in case that your Website is not responding from at least one configured geographic region. Once your Web application is not responding from all configured regions a problem will be elevated to report a global outage. See below for an example for a synthetic monitor local outage event:

Synthetic monitors local outage event

Synthetic monitor outage (SYNTHETIC_AVAILABILITY) Dynatrace API allows to receive synthetic test results and synthetic events from ‘Dynatrace advanced synthetic’. This availability event is raised if the advanced synthetic monitor reports the unavailability of a system under test. The lifecycle of advanced synthetic events is managed through the advanced synthetic monitor. Refer to your alerting settings within advanced synthetic. See below an example for an advanced synthetic availability event:

Advanced synthetic outage

Availability log pattern found (HOST_LOG_AVAILABILITY) Dynatrace log analytics allows you to define log patterns that indicate availability related problems on a host. All detected pattern appearances are registered and if the number of detected pattern exceeds your critical threshold an availability log pattern event is raised. Find below an example for a typical availability log event:

Host log availability event

Availability log pattern found (PROCESS_LOG_AVAILABILITY) Dynatrace log analytics allows you to define log pattern that indicate availability related problems on a process level. Every appearance of a configured pattern is counted over time and if the number of detected pattern exceeds your critical threshold an availability log pattern event is raised. Find below an example for a typical availability log event on process level:

Process log availability event

Custom availability event (AVAILABILITY_EVENT) This generic availability event can be used by monitoring plugins or through the Dynatrace REST API to raise a customized availability event with a user defined title. An example could be a custom availability event with a user defined title ‘Batch process schedule outage’.

Custom availability event

Error events

JavaScript error rate increase (JAVASCRIPT_ERROR_RATE_INCREASED) By default, JavaScript error rate increase events are detected by the automatic baselining that learns the typical JavaScript error rates for each application and user action. If the JavaScript error rate degrades from the learned baseline, Dynatrace raises a JavaScript error rate increase event. Navigate to the baselining settings in Settings > Anomaly detection > Applications to tweak the sensibility of the alerting on top of the baselining. In case you need a static threshold rather than a baseline based alerting approach, change from automatic mode to ‘using fixed thresholds’, as shown below. The sensitivity controls the level of statistical confidence required to raise an event. Low sensitivity means a high confidence is required and vice versa. For example, to see events immediately even when only few data points have breached the threshold, a high sensitivity can be chosen.

JavaScript error event settings

See below a typical example of a JavaScript error rate increase event:

JavaScript error rate increase event

Mobile app crash rate increased (MOBILE_APP_CRASH_RATE_INCREASED) Mobile app (Android and iOS) crashes are recorded along with the crash stack traces and context information. Dynatrace learns a baseline of the number of crashes per app version and alerts in case that one of your mobile app versions degrades in terms of number of crashes. The observation period for alerting on mobile app crashes is 10 minutes. If Dynatrace observes an unusually high crash rate of one of your mobile app versions within a sliding window of 10 minutes a mobile app crash rate increase event will be raised. To avoid over alerting in case of low load periods a minimum number of 10 concurrent users with the same app version need to be active before the alerting will raise an event.
See below a typical example of a mobile crash rate increase event:

Mobile app crash rate increase event

High rate of dropped packets (HIGH_DROPPED_PACKETS_RATE) By default, Dynatrace alerts if the percentage of dropped packets on TCP network level is higher than 10% and the total number of dropped packets is higher than 10 packets/s in 3 out of 5-minute samples. Navigate to your infrastructure anomaly detection settings under Settings > Anomaly detection > Infrastructure to adapt the alerting sensitivity. See below a typical example of a high rate of dropped packets event:

High rate of dropped packets event

High number of network errors (HIGH_NETWORK_ERROR_RATE) By default, Dynatrace alerts if the percentage of failed connection attempts on TCP network level is higher than 3% and the total number of failed connections is higher than 10 connections/min in 3 out of 5-minute samples. Navigate to your infrastructure anomaly detection settings under Settings > Anomaly detection > Infrastructure to adapt the alerting sensitivity. See below a typical example of a high number of network errors event:

High number of network errors event

Lambda high error rate (LAMBDA_FUNCTION_HIGH_ERROR_RATE) Lambda high error rate event informs about high rate of failed invocations for a specific AWS Lambda function. By default, this event is raised if the rate of failed invocations is higher than 5% in 3 out of 5-minute samples. Navigate to Settings > Anomaly detection > Infrastructure > AWS Functions to adapt the sensitivity of Lambda error rate alerting. See below a typical example for a Lambda high error rate event:

Lambda high error rate event

Elastic load balancer has a high backend failure rate (ELASTIC_LOAD_BALANCER_HIGH_BACKEND_FAILURE_RATE) Dynatrace automatically reports the number of failed connection attempts into the backend of an AWS load balancer. By default, Dynatrace opens an alert, if the number of failed backend connection attempts is higher than 10 per minute for at least 3 out of 5-minute observation samples. See below a typical example of an Elastic load balancer has a high backend failure rate event:

Elastic load balancer has a high backend failure rate event

Error log pattern found (HOST_LOG_ERROR) Dynatrace log analytics allows you to define log patterns that indicate error related problems on host level. Every appearance of a configured pattern is counted over time and if the number of detected patterns exceeds your critical threshold an error log pattern event is raised. Find below an example for a typical error log event on host level:

Error log pattern found event

Error log pattern found (PROCESS_LOG_ERROR) Dynatrace log analytics allows you to define log patterns that indicate error related problems on a process or process group level. Every appearance of a configured pattern is counted over time and if the number of detected patterns exceeds your critical threshold an error log pattern event is raised. Find below an example for a typical error log event on process level:

Error log pattern found event

Custom error event (ERROR_EVENT) This generic error event can be used by monitoring plugins or through the Dynatrace REST API to raise a customized error event with a user defined title. An example could be a custom error event with a user defined title ‘Batch process schedule high error rate’, as shown below:

Custom error event

Restart sequence (RDS_RESTART_SEQUENCE) Dynatrace detects and alerts on abnormal relational database service (RDS) restarts. By default, a restart sequence event is raised if an RDS instance shows a total number of restarts higher than 2 within an observation period of 20min.
The screenshot below shows a typical RDS restart sequence event:

RDS restart sequence event

Slowdown events

User action duration degradation (USER_ACTION_DURATION_DEGRADATION) User action duration degradation events are detected in multiple ways within Dynatrace. By default, Dynatrace uses automatic baselining to detect degradation events of either the 50th percentile (Median) or the 90th percentile of an application’s user action performance. On application level the baseline distinguishes between three main categories of user actions, there are Load actions, XHR actions and Custom user actions. If the performance of one of those categories degrades, the event is raised with a reference to one of those three categories. Dynatrace automatically collects multi-dimensional baselines for user actions, where dimensions are the user action, geographic region, operating system and browser type. If a slowdown is detected within a specific combination of those dimensions, e.g.: User action: payment, Browser: Firefox, the resulting event contains a reference to the violating dimensions. Refer to following help page for details about how the baseline is calculated and events are raised. The screenshot below shows a typical User action duration degradation event:

User action duration degradation event

Response time degradation (SERVICE_RESPONSE_TIME_DEGRADED) Response time degradation events are detected in multiple ways within Dynatrace. By default, Dynatrace uses automatic baselining to detect degradation events of either the 50th percentile (Median) or the 90th percentile of the service and service method response time. Refer to following help page for details about how the baseline is calculated and events are raised. Navigate to Settings > Anomaly detection > Services to find the automatic baselining settings. Alternatively, users can select to use a static threshold instead of the dynamic baselining. The static threshold is again compared to the monitored 50th and 90th percentile. See below for a typical service response time degradation event.

Response time degradation event

Performance log pattern found (HOST_LOG_PERFORMANCE) Dynatrace log analytics allows you to define log pattern that indicate performance related problems on host level. Every appearance of a configured pattern is counted over time and if the number of detected pattern exceeds your critical threshold a performance log pattern event is raised. Find below an example for a typical performance log event on host level:

Performance log pattern found event

Performance log pattern found (PROCESS_LOG_PERFORMANCE) Dynatrace log analytics allows you to define log pattern that indicate performance related problems on a process or process group level. Every appearance of a configured pattern is counted over time and if the number of detected pattern exceeds your critical threshold a performance log pattern event is raised. Find below an example for a typical performance log event on process level:

Performance log pattern found event

Synthetic monitor performance threshold violation (SYNTHETIC_SLOWDOWN) Users can define specific performance thresholds for synthetic monitors that check an application from multiple global locations in a regular interval. If the user defined performance threshold on one of the configured locations is not met, a synthetic monitor performance threshold violation event is raised. Find a typical synthetic monitor performance threshold violation event below:

Synthetic monitor performance threshold violation

Custom performance event (PERFORMANCE_EVENT) This generic performance event can be used by monitoring plugins or through the Dynatrace REST API to raise a customized performance event with a user defined title. An example could be a custom performance event with a user defined title ‘Batch process schedule slowdown’, as shown below:

Custom performance event

Resource events

Unexpected high traffic (UNEXPECTED_HIGH_LOAD) Dynatrace collects a multidimensional baseline for application and user action traffic and therefore learns the typical traffic pattern of all your applications and user actions. Alerting on abnormal high application traffic is an opt-in option within the application settings, that you find under Settings > Anomaly detection > Applications. If enabled, Dynatrace follows a two-seasonal pattern (daily and weekly) in alerting on abnormal traffic situations. The actual monitored application traffic is compared with the same period of last week and alerts if the comparison reports unusually high traffic. Refer to following help page to read about the details of load and traffic prediction within Dynatrace. See below a typical example of an unexpected high traffic event:

Unexpected high traffic event

CPU saturation (CPU_SATURATED) CPU saturation events are raised on host level whenever the CPU usage is higher than a critical threshold. By default, Dynatrace alerts if the CPU usage is higher than 95% in 3 out of 5-minute samples. Along with the typical OneAgent related CPU saturation event, also AWS cloud management can report CPU saturation events for EC2 instances. In case of a OneAgent monitored EC2 instance, Dynatrace always prioritizes the OneAgent event before the AWS event. If one of your EC2 instances is not monitored by a OneAgent but you have an active AWS integration with Dynatrace set up, a CPU saturation event will be raised from AWS integration. See below a typical example of an CPU saturation event raised through AWS integration: Find below a typical example of a CPU saturation event:

CPU saturation event

Find below a typical example of a CPU saturation event that was raised through AWS integration:

AWS CPU saturation event

Memory saturation (MEMORY_SATURATED) By default, Dynatrace alerts if the memory usage is higher than 90% on Windows or 80% on Linux AND memory page fault rate is higher than 100 faults/s on Windows or 20 faults/s on Linux in 3 out of 5 samples. Find below a typical example of a memory saturation event:

Memory saturation event

Low disk space (LOW_DISK_SPACE) By default, Dynatrace alerts if free disk space on any of your disks is lower than 3% in at least 3 out of 5-minute observation samples. Disk thresholds within Dynatrace are highly configurable either on host level or on global settings level. To support large numbers of disks, you can define global disk threshold rules along with flexible tag filters to group subsets of hosts. Find below a typical example of a low disk space event:

Low disk space event

Slow disk (SLOW_DISK) By default, Dynatrace alerts if disk read and write time on any of your disks is higher than 200ms in at least 3 out of 5-minute observation samples. Disk thresholds within Dynatrace are highly configurable either on host level or on global settings level. To support large numbers of disks, you can define global disk threshold rules along with flexible tag filters to group subsets of hosts. Find below a typical example of a slow disk space event:

Slow disk event

Low number of inodes available (HOST_DISK_LOW_INODES) By default, Dynatrace alerts if the percentage of available inodes on any of your disks is lower than 5% in at least 3 out of 5-minute observation samples. Disk thresholds within Dynatrace are highly configurable either on host level or on global settings level. To support large numbers of disks, you can define global disk threshold rules along with flexible tag filters to group subsets of hosts. Find below a typical example of a low number of inodes available event:

Low number of inodes available event

High network utilization (HIGH_NETWORK_UTILIZATION) By default, Dynatrace alerts if sent/received traffic utilization is higher than 90% in 3 out of 5-minute samples. Find below a typical example of a high network utilization event:

High network utilization event

Long garbage-collection time (HIGH_GC_ACTIVITY) By default, Dynatrace alerts on long garbage-collection times if garbage-collection time is higher than 40% or suspension is higher than 25% in 3 out of 5-minute samples. See below a typical example of a log garbage-collection time event:

Long garbage-collection time event

High latency (HIGH_LATENCY) By default, Dynatrace detects and alerts on high latency within your relational database services (RDS).

See below a typical example for a high latency event on RDS instances:

High latency event

I/O commands queued (INSUFFICIENT_DISK_QUEUE_DEPTH)

I/O commands queued event

Custom resource contention event (RESOURCE_EVENT) This generic resource contention event can be used by monitoring plugins or through the Dynatrace REST API to raise a customized resource contention event with a user defined title. An example could be a custom resource contention event with a user defined title ‘Low batch job pool’, as shown below:

Custom resource event

Custom alerts

Custom alerts (CUSTOM_ALERT) Custom alerts can be defined by users in case that a specific threshold notification on a selected metric is needed. Other than all other events, custom alerts are not analyzed by the Dynatrace AI and do not show any root-cause as the threshold was defined by a user. A custom alert represents a simple way of defining a threshold on a given metric along with a sliding window size and Dynatrace will send out an alert whenever that metric breaches the customer defined threshold. Users can define alerts either if the actual metric value is above or below the user defined threshold. As a metric can be recorded by multiple components within your environment, Dynatrace will always alert with a reference to the component that shows the violating metric. See below an example for a custom alert that was defined for the metric ‘CPU Usage’:

Custom alert

Info events

Annotation (CUSTOM_ANNOTATION) Annotation events can be sent to Dynatrace by third-party toolchains to highlight interesting periods in time. There are multiple channels that can be used to automatically report annotations into Dynatrace, such as the Dynatrace REST API as well as monitoring plugins. Refer to following help page that explains how to report annotations from a third-party system.
See below an example for an annotation event that can be filled with custom context information:

Custom annotation event

Deployment (CUSTOM_DEPLOYMENT) Deployment events can be sent to Dynatrace by third-party toolchains to report software deployments along with important business process related metainformation, such as the authorizer, product owner and version details. There are multiple channels that can be used to automatically report deployment events into Dynatrace, such as the Dynatrace REST API as well as monitoring plugins. Refer to following help page that explains how to report deployment events from a third-party system.
See below an example for a custom deployment event that is filled with context information about an important software deployment:

Custom deployment event

Info (CUSTOM_INFO) Info events can be sent to Dynatrace by third-party toolchains to report general purpose information about an important period. There are multiple channels that can be used to automatically report information events into Dynatrace, such as the Dynatrace REST API as well as monitoring plugins. Refer to following help page that explains how to report events from a third-party system.
See below an example for an info event that can be filled with custom context information:

Custom info event

Log pattern matched on host or process (LOG_MATCHED) Users can define custom log pattern matching rules that apply either on host level or process level log files. If the user defined threshold of matched patterns per minute is breached, a log info event is raised and shown either on host or on process level. See following example of a log info event:

Log info event

Elastic load balancer has a failure rate (ELASTIC_LOAD_BALANCER_HIGH_FAILURE_RATE) Dynatrace automatically reports the number of failed connection attempts to an AWS load balancer. By default, Dynatrace opens an alert, if the number of failed connection attempts is higher than 10 per minute for at least 3 out of 5-minute observation samples.

Elastic load balancer has high failure rate event

Elastic load balancer has a high unhealthy host rate (ELASTIC_LOAD_BALANCER_HIGH_UNHEALTHY_HOST_RATE) Dynatrace automatically opens an info event in case that a high rate of unhealthy hosts is detected within an AWS load balancer. See how the event is shown within the events section of the affected AWS load balancer:

Elastic load balancer has a high unhealthy host rate event

JavaScript framework changes (APPLICATION_JS_FRAMEWORK_DETECTED) The real user monitoring agent keeps track of all the JavaScript frameworks that are used within your applications. A JavaScript framework change event is opened, if either a new JavaScript framework is detected or was recently removed. See below a typical example for a JavaScript framework change event:

JavaScript framework detected event