Testing the webhook
The webhook can be tested locally with a curl command, like so:
curl -i 'http://localhost:9393/webhook' \
-d '{
"receiver":"dynatrace-receiver",
"status":"firing",
"alerts":[
{
"status":"firing",
"labels":{
"alertname":"TargetDown",
"job":"kubelet",
"namespace":"kube-system",
"prometheus":"kubelet",
"service":"kubelet",
"severity":"warning"
},
"annotations":{
"message":"11.11% of the kubelet/kubelet targets in kube-system"
},
"startsAt":"2021-03-19T01:35:45.72Z",
"endsAt":"0001-01-01T00:00:00Z",
"generatorURL":"http://openshift.com",
"fingerprint":"e425bb91067b6c9e"
}
],
"groupKey":"{}:{\"alertname\": \"Test Alert\", \"cluster\": \"Cluster 02\", \"service\": \"Service 01\"}",
"groupLabels":{
"alertname":"Test Alert",
"cluster":"Cluster 02",
"service":"Service 02"
},
"commonLabels":{
"alertname":"Test Alert",
"cluster":"Cluster 02",
"service":"Service 02"
},
"commonAnnotations":{
"annotation_01":"annotation 01",
"annotation_02":"annotation 03"
},
"externalURL":"http://8598cebf58a1:9093"
}'
Details
Details
A binary called dynatrace-receiver will always run on the Activegate machine, listening on the port that was configured on the extension settings.
This binary is responsible for a couple tasks:
- Listening for POST requests from the Alertmanager
- Periodically retrieve Problem IDs from Dynatrace and correlate those IDs with events sent
- Periodically resend events to Dynatrace, to keep problems opened
- Periodically delete stale events (events that have been opened for more than 5 days)
- Maintain a in disk cache of Problems (and alerts) and Custom Devices
- These caches are thread safe
For task number 1, there are several details involved when a new request arrives
- Parse the request to attempt construct a Custom Device Name
- If this cannot be done, a default Custom Device Name will be used
- Calculate the Custom Device ID from the Custom Device Name, check if this Custom Device ID already exists locally in a cache
- This is done without calling the Dynatrace API
- If the Custom Device ID does not exist on the cache, use the Dynatrace API to create the Custom Device ID
- Calculate a hash of the GroupKey of the request
- This will later be used to correlate opened problems with events
- Determine if this event opens a problem or not, based on the severity label
- Send the event to Dynatrace, store it locally on a Problem cache
- Determine if this event closes a problem, based on the Status field (firing or resolved)
- If the event closes a problem, check the cache to see if a Problem ID was already obtained for this GroupKey
- If the Problem ID does not exist yet, attempt to get it from Dynatrace by using the GroupKey hash
- If the Problem ID was found, close the Problem with a comment
- If the Problem ID was not found, nothing can be done, and the event is deleted from the cache
- This can happen for instance if the event that opened the problem was sent while the binary was not running, so we only get a resolved event without a firing one
- After the Problem is closed, the event is deleted from the cache
Tasks number 2, 3 and 4 are implemented as cron jobs, running inside the same binary
- Task number 2 runs every 2 minutes, it updates all events that currently do not have a Problem ID in the cache
- This is an important task because it allows us to later close Problems, this operation can only be done with the Problem ID
- Task number 3 runs every 30 minutes, it resends events in Dynatrace before they expire (they expire after 2 hours if not refreshed)
- Task number 4 runs every 2 hours, it deletes events older than 5 days from the cache
The logs and the caches are stored at
TEMP_FOLDER/dynatrace-receiver
By default, on linux this is /tmp/dynatrace-receiver