Site Reliability Guardian
Latest Dynatrace
Automated change impact analysis to validate service availability, performance, and capacity objectives across various systems. Enable the DevOps platform engineers to make the right release decisions and empower SREs to apply Service-Level Objectives (SLOs) for their critical services.
Site Reliability Guardian concepts
Site Reliability Guardian is based on the following concepts:
Guardian
A guardian is the grouping of objectives. It is built around a set of entities reflecting a service or application you want to safeguard.
A guardian provides you with a default automation workflow that performs the objective validation. As a result, a guardian always represents the latest validation result derived from the objectives.
Objective
Objectives are means for measuring the performance, availability, and capacity of your services. Objectives are measured by indicators.
Indicator
An indicator is the value against which the warning and failure thresholds are checked using a comparison operator. To retrieve the indicator value, use DQL or reference an existing SLO.
Warning and failure thresholds
The warning and failure thresholds determine whether the measured value of the indicator meets the objective, is close to violating the objective, or violates the objective.
Warning and failure are optional, the objective validation can vary:
- If both the warning and failure thresholds are set, the objective validation can return a warning or failed status.
- If just the warning threshold is set, the objective validation can return a warning status.
- If no threshold is set, the objective validation does not return a status but is used for informational purposes.
Operator
The comparison operator defines whether the objective is met: the indicator is less or equals (Lower is better
), or greater or equals (Higher is better
) than the warning and failure threshold.
Create a guardian with objectives
Within the Site Reliability Guardian, you can manage multiple guardians.
- In the Dynatrace Launcher, select Site Reliability Guardian
.
- On the overview page, select + Guardian to create a new guardian. A new guardian is displayed in the editor.
- Provide a name for the guardian. Optionally, add a description or tags.
- Add objectives by defining a custom DQL or referencing an already existing SLO.
For more information, see Site Reliability Guardian objective examples.
Tags
To keep you guardians organized, you can assign tags to them. You can use simple value-only tags and the key:value
format.
To assign a tag to your guardian, specify it in the Add tags to your guardian section during creation or in the edit mode and select Add. To assign a value-only tag, type it in the Key field.
To filter the list of all guardians by a tag, type the tag in the Search by name or tag field—the page automatically updates to show only guardians with matching tags.
Create and customize automation workflow
To automate the execution of a guardian, a workflow can be used that subscribes to events that act as triggers.
- To create the workflow, open your guardian and click Create workflow behind the three dots in the top-right corner.
By default, a new workflow is created that:
- Subscribes to BizEvents and triggers the workflow when an event matches the filter:
tag.service == "carts" AND tag.stage == "production"
. Adapt this filter depending on the event you expect as a workflow trigger. - Performs a validation of the guardian for which this workflow was created.
In Workflows, you can customize the workflow depending on your needs. Do not remove the workflow action run_validation
since this would disconnect the guardian from the workflow.
To learn more about workflows for a guardian, open the Getting started with Automation guide in the help menu in the top right corner of the Site Reliability Guardian.
Validate guardian and its objectives
If a workflow is created, your guardian is validated automatically. You can also perform the validation manually.
Automated validation
The event subscriptions in the workflow define when the validation of a guardian has triggered automatically.
Manual validation
You can perform a validation of a guardian by selecting the Validate button on the overview screen or within the validation details screen.
- Select the validation timeframe.
- Click the Validate button.
Individual objective result
For each objective, the validation returns the derived value and classification. The severity goes from the highest (1) to the lowest (5).
Severity | Name | Description |
---|---|---|
1 | Error | The objective could not be validated due to an error deriving the indicator. |
2 | Fail | The value violates the failure threshold; the objective is not met. |
3 | Warning | The value is in the warning range; the objective is met, but close to failure. |
4 | Pass | The value is within the target range, the objective is met. |
5 | Info | No classification, but the objective's value can be used for informational purposes. |
Overall validation result
After the validation of each objective is done, the most severe of individual validations is derived as the overall validation result.
Leverage the overall validation result of a guardian to:
- make a release decision in your delivery pipeline
- report on the current status of your service.
Install, update, or uninstall
To install, update, or uninstall the Site Reliability Guardian, use the Dynatrace Hub.
- Search for the Site Reliability Guardian in the Dynatrace Hub.
- Select the entry and click either the Install, Update, or Uninstall action to perform this action in your environment.
Note that uninstalling the Site Reliability Guardian will delete all guardians and their configuration. Workflows that reference the workflow action of the Site Reliability Guardian will stay but fail due to the missing guardian.
Permissions
The Dynatrace Hub entry for the Site Reliability Guardian lists the required permissions.
- Go to Dynatrace Hub.
- Find and open Site Reliability Guardian.
- Select the Technical Information tab.
- Check the User Permissions section for a list of all the permissions you need to include in the policies bound to user groups that are allowed to use Site Reliability Guardian.
For more information, see Manage user permissions with IAM policies and Workflow authorization settings.