Troubleshoot issues on Kubernetes/OpenShift
Find out how to troubleshoot issues you might encounter in the following situations.
General troubleshooting
Debug logs
By default, OneAgent logs are located in /var/log/dynatrace/oneagent
.
To debug Dynatrace Operator issues, run
kubectl -n dynatrace logs -f deployment/dynatrace-operator
oc -n dynatrace logs -f deployment/dynatrace-operator
You might also want to check the logs from OneAgent pods deployed through Dynatrace Operator.
kubectl get pods -n dynatrace
NAME READY STATUS RESTARTS AGE
dynatrace-operator-64865586d4-nk5ng 1/1 Running 0 1d
dynakube-oneagent-<id> 1/1 Running 0 22h
kubectl logs dynakube-oneagent-<id> -n dynatrace
oc get pods -n dynatrace
NAME READY STATUS RESTARTS AGE
dynatrace-operator-64865586d4-nk5ng 1/1 Running 0 1d
dynakube-classic-8r2kq 1/1 Running 0 22h
oc logs oneagent-66qgb -n dynatrace
Troubleshoot common Dynatrace Operator setup issues using the troubleshoot
subcommand
Dynatrace Operator version 0.9.0+
Run the command below to retrieve a basic output on DynaKube status, such as:
-
Namespace: If the
dynatrace
namespace exists (name can be overwritten via parameter) -
DynaKube:
- If
CustomResourceDefinition
exists - If
CustomResource
with the given name exists (name can be overwritten via parameter) - If the API URL ends with
/api
- If the secret name is the same as DynaKube (or
.spec.tokens
if used) - If the secret has
apiToken
andpaasToken
set - If the secret for
customPullSecret
is defined
- If
-
Environment: If your environment is reachable from the Dynatrace Operator pod using the same parameters as the Dynatrace Operator binary (such as proxy and certificate).
-
OneAgent and ActiveGate image: If the registry is accessible; if the image is accessible from the Dynatrace Operator pod using the registry from the environment with (custom) pull secret.
kubectl exec deploy/dynatrace-operator -n dynatrace -- dynatrace-operator troubleshoot
If you use a different DynaKube name, add the --dynakube <your_dynakube_name>
argument to the command.
Example output if there are no errors for the above-mentioned fields:
{"level":"info","ts":"2022-09-12T08:45:21.437Z","logger":"dynatrace-operator-version","msg":"dynatrace-operator","version":"<operator version>","gitCommit":"<commithash>","buildDate":"<release date>","goVersion":"<go version>","platform":"<platform>"}
[namespace ] --- checking if namespace 'dynatrace' exists ...
[namespace ] √ using namespace 'dynatrace'
[dynakube ] --- checking if 'dynatrace:dynakube' Dynakube is configured correctly
[dynakube ] CRD for Dynakube exists
[dynakube ] using 'dynatrace:dynakube' Dynakube
[dynakube ] checking if api url is valid
[dynakube ] api url is valid
[dynakube ] checking if secret is valid
[dynakube ] 'dynatrace:dynakube' secret exists
[dynakube ] secret token 'apiToken' exists
[dynakube ] customPullSecret not used
[dynakube ] pull secret 'dynatrace:dynakube-pull-secret' exists
[dynakube ] secret token '.dockerconfigjson' exists
[dynakube ] proxy secret not used
[dynakube ] √ 'dynatrace:dynakube' Dynakube is valid
[dtcluster ] --- checking if tenant is accessible ...
[dtcluster ] √ tenant is accessible
Generate a support archive using the support-archive
subcommand
Dynatrace Operator version 0.11.0+
Use support-archive
to generate a support archive containing all the files that can be potentially useful for the RFA analysis:
operator-version.txt
—a file containing the current Operator version informationlogs
—logs from all containers of the Dynatrace Operator pods in the Dynatrace Operator namespace (usuallydynatrace
); this also includes logs of previous containers, if available:dynatrace-operator
dynatrace-webhook
dynatrace-oneagent-csi-driver
manifests
—the Kubernetes manifests for Dynatrace Operator components and deployed DynaKubes in the Dynatrace Operator namespacetroubleshoot.txt
—output of a troubleshooting command that is automatically executed by thesupport-archive
subcommandsupportarchive_console.log
—complete output of thesupport-archive
subcommand
Usage
To create a support archive, execute the following command.
kubectl exec -n dynatrace deployment/dynatrace-operator -- dynatrace-operator support-archive
The collected files are stored in a zipped tarball and can be downloaded from the pod using the kubectl cp
command.
kubectl -n dynatrace cp <operator pod name>:/tmp/dynatrace-operator/operator-support-archive.tgz ./tmp/dynatrace-operator/operator-support-archive.tgz
The recommended approach is to use the --stdout
parameter line switch to stream the tarball directly to your disk.
kubectl exec -n dynatrace deployment/dynatrace-operator -- dynatrace-operator support-archive --stdout > operator-support-archive.tgz
If you use the --stdout
parameter, all support archive command output is written to stderr
so as not to corrupt the support archive tar file.
Sample output
The following is sample output from running support-archive
with the --stdout
parameter.
kubectl exec -n dynatrace deployment/dynatrace-operator -- dynatrace-operator support-archive --stdout > operator-support-archive.tgz
[support-archive] dynatrace-operator {"version": "v0.11.0", "gitCommit": "...", "buildDate": "...", "goVersion": "...", "platform": "linux/amd64"}
[support-archive] Storing operator version into operator-version.txt
[support-archive] Starting log collection
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-bdnpc/server.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-bdnpc/provisioner.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-bdnpc/registrar.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-bdnpc/liveness-probe.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-cb4pc/server.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-cb4pc/provisioner.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-cb4pc/registrar.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-cb4pc/liveness-probe.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-k8bl5/server.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-k8bl5/provisioner.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-k8bl5/registrar.log
[support-archive] Successfully collected logs logs/dynatrace-oneagent-csi-driver-k8bl5/liveness-probe.log
[support-archive] Successfully collected logs logs/dynatrace-operator-6d9fd9b9fc-sw5ll/dynatrace-operator.log
[support-archive] Successfully collected logs logs/dynatrace-webhook-7d84599455-bfkmp/webhook.log
[support-archive] Successfully collected logs logs/dynatrace-webhook-7d84599455-vhkrh/webhook.log
[support-archive] Starting K8S object collection
[support-archive] Collected manifest for manifests/injected_namespaces/Namespace-default.yaml
[support-archive] Collected manifest for manifests/dynatrace/Namespace-dynatrace.yaml
[support-archive] Collected manifest for manifests/dynatrace/Deployment-dynatrace-operator.yaml
[support-archive] Collected manifest for manifests/dynatrace/Deployment-dynatrace-webhook.yaml
[support-archive] Collected manifest for manifests/dynatrace/StatefulSet-dynakube-activegate.yaml
[support-archive] Collected manifest for manifests/dynatrace/DaemonSet-dynakube-oneagent.yaml
[support-archive] Collected manifest for manifests/dynatrace/DaemonSet-dynatrace-oneagent-csi-driver.yaml
[support-archive] Collected manifest for manifests/dynatrace/DynaKube-dynakube.yaml
Debug configuration and monitoring issues using the Kubernetes Monitoring Statistics extension
The Kubernetes Monitoring Statistics extension can help you:
- Troubleshoot your Kubernetes monitoring setup
- Troubleshoot your Prometheus integration setup
- Get detailed insights into queries from Dynatrace to the Kubernetes API
- Receive alerts when your Kubernetes monitoring setup experiences issues
- Get alerted on slow response times of your Kubernetes API
Set up monitoring errors
Pods stuck in Terminating
state after upgrade
If your CSI driver and OneAgent pods get stuck in Terminating
state after upgrading from Dynatrace Operator version 0.9.0, you need to manually delete the pods that are stuck.
Run the command below.
kubectl delete pod -n dynatrace --selector=app.kubernetes.io/component=csi-driver,app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/version=0.9.0 --force --grace-period=0
oc delete pod -n dynatrace --selector=app.kubernetes.io/component=csi-driver,app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/version=0.9.0 --force --grace-period=0
Unable to retrieve the complete list of server APIs
Dynatrace Operator
Example error:
unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
If the Dynatrace Operator pod logs this error, you need to identify and fix the problematic services. To identify them
- Check available resources.
kubectl api-resources
- If the command returns this error, list all the API services and make sure there aren't any
False
services.
kubectl get apiservice
CrashLoopBackOff: Downgrading OneAgent is not supported, please uninstall the old version first
Dynatrace Operator
If you get this error, the OneAgent version installed on your host is later than the version you're trying to run.
Solution: First uninstall OneAgent from the host, and then select your desired version in the Dynatrace web UI or in DynaKube. To uninstall OneAgent, connect to the host and run the uninstall.sh
script. (The default location is /opt/dynatrace/oneagent/agent/uninstall.sh
)
For CSI driver deployments, use the following commands instead:
- Delete the Dynakube custom resources.
- Delete the CSI driver manifest.
- Delete the
/var/lib/kubelet/plugins/csi.oneagent.dynatrace.com
directory from all Kubernetes nodes. - Reapply the CSI driver and DynaKube custom resources.
Crash loop on pods when installing OneAgent
Application-only monitoring
If you get a crash loop on the pods when you install OneAgent, you need to increase the CPU memory of the pods.
Deployment seems successful but the dynatrace-oneagent
container doesn't show up as ready
DaemonSet
kubectl get ds/dynatrace-oneagent --namespace=kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
dynatrace-oneagent 1 1 0 1 0 beta.kubernetes.io/os=linux 14mc
kubectl logs -f dynatrace-oneagent-abcde --namespace=dynatrace
09:46:18 Started agent deployment as Docker image, PID 1234.
09:46:18 Agent installer can only be downloaded from secure location. Your installer URL should start with 'https': REPLACE_WITH_YOUR_URL
Change the value REPLACE_WITH_YOUR_URL
in the dynatrace-oneagent.yml
DaemonSet with the Dynatrace OneAgent installer URL.
oc get pods
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 0/1 ErrImagePull 0 3s
oc logs -f dynatrace-oneagent-abcde
Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: image can't be pulled
This is typically the case if the dynatrace
service account hasn't been allowed to pull images from the RHCC.
Deployment seems successful, however the dynatrace-oneagent
image can't be pulled
DaemonSet
Example error:
oc get pods
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 0/1 ErrImagePull 0 3s
oc logs -f dynatrace-oneagent-abcde
Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: image can't be pulled
This is typically the case if the dynatrace
service account hasn't been allowed to pull images from the RHCC.
Deployment seems successful, but the dynatrace-oneagent
container doesn't produce meaningful logs
DaemonSet
Example error:
kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 0/1 ContainerCreating 0 3s
kubectl logs -f dynatrace-oneagent-abcde --namespace=dynatrace
Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: ContainerCreating
oc get pods
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 0/1 ContainerCreating 0 3s
oc logs -f dynatrace-oneagent-abcde
Error from server (BadRequest): container "dynatrace-oneagent" in pod "dynatrace-oneagent-abcde" is waiting to start: ContainerCreating
This is typically the case if the container hasn't yet fully started. Simply wait a few more seconds.
Deployment seems successful, but the dynatrace-oneagent
container isn't running
DaemonSet
oc process -f dynatrace-oneagent-template.yml ONEAGENT_INSTALLER_SCRIPT_URL="[oneagent-installer-script-url]" | oc apply -f -
daemonset "dynatrace-oneagent" created
Please note that quotes are needed to protect the special shell characters in the OneAgent installer URL.
oc get pods
No resources found.
This is typically the case if the dynatrace
service account hasn't been configured to run privileged pods.
oc describe ds/dynatrace-oneagent
Name: dynatrace-oneagent
Image(s): dynatrace/oneagent
Selector: name=dynatrace-oneagent
Node-Selector: <none>
Labels: template=dynatrace-oneagent
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------------ -------
6m 3m 17 {daemon-set } Warning FailedCreate Error creating: pods "dynatrace-oneagent-" is forbidden: unable to validate against any security context constraint: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]
Deployment was successful, but monitoring data isn't available in Dynatrace
DaemonSet
Example:
kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 1/1 Running 0 1m
oc get pods
NAME READY STATUS RESTARTS AGE
dynatrace-oneagent-abcde 1/1 Running 0 1m
This is typically caused by a timing issue that occurs if application containers have started before OneAgent was fully installed on the system. As a consequence, some parts of your application run uninstrumented. To be on the safe side, OneAgent should be fully integrated before you start your application containers. If your application has already been running, restarting its containers will have the very same effect.
No pods scheduled on control-plane nodes
DaemonSet
Kubernetes version 1.24+
Taints on master and control plane nodes are changed on Kubernetes versions 1.24+, and the OneAgent DaemonSet is missing appropriate tolerations in the DynaKube custom resource.
To add the necessary tolerations, edit the DynaKube YAML as follows.
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
Error when applying the custom resource on GKE
Example error:
Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.dynatrace.com": Post "https://dynatrace-webhook.dynatrace.svc:443/validate?timeout=2s (https://dynatrace-webhook.dynatrace.svc/validate?timeout=2s)": context deadline exceeded
If you are getting this error when trying to apply the custom resource on your GKE cluster, the firewall is blocking requests from the Kubernetes API to the Dynatrace Webhook because the required port (8443) is blocked by default.
The default allowed ports (443 and 10250) on GCP refer to the ports exposed by your nodes and pods, not the ports exposed by any Kubernetes services. For example, if the cluster control plane attempts to access a service on port 443 such as the Dynatrace webhook, but the service is implemented by a pod using port 8443, this is blocked by the firewall.
To fix this, add a firewall rule to explicitly allow ingress to port 8443.
For more information about this issue, see API request that triggers admission webhook timing out.
CannotPullContainerError
If you get errors like this on your pods when installing Dynatrace OneAgent, your Docker download rate limit has been exceeded.
CannotPullContainerError: inspect image has been retried [X] time(s): httpReaderSeeker: failed open: unexpected status code
For details, consult the Docker documentation.
Limit log timeframe
cloudNativeFullStack applicationMonitoring
Dynatrace Operator version 0.10.0+
If there's DiskPressure
on your nodes, you can configure the CSI driver log garbage collection interval to lower the storage usage of the CSI driver. The default value of keeping logs before they are deleted from the file system is 7
(days). To edit this timeframe, select one of the options below, depending on your deployment mode.
Be careful when setting this value; you might need the logs to investigate problems.
- Edit the manifests of the CSI driver daemonset (
kubernetes-csi.yaml
,openshift-csi.yaml
), by replacing the placeholders (<your_value>
) with your value.
apiVersion: apps/v1
kind: DaemonSet
...
spec:
...
template:
...
spec:
...
containers:
...
- name: provisioner
...
env:
- name: MAX_UNMOUNTED_VOLUME_AGE
value: <your_value> # defined in days, must be a plain number. `0` means logs are immediately deleted. If not set, defaults to `7`.
- Apply the changes.
Edit values.yaml
to set the maxUnmountedVolumeAge
parameter under the csidriver
section.
csidriver:
enabled: true
...
maxUnmountedVolumeAge: "" # defined in days, must be a plain number. `0` means logs are immediately deleted. If not set, defaults to `7`.
Connectivity issues between Dynatrace and your cluster
Problem with ActiveGate token
Example error on the ActiveGate deployment status page:
Problem with ActiveGate token (reason:Absent)
Example error on Dynatrace Operator logs:
{"level":"info","ts":"2022-09-22T06:49:17.351Z","logger":"dynakube-controller","msg":"reconciling DynaKube","namespace":"dynatrace","name":"dynakube"}
{"level":"info","ts":"2022-09-22T06:49:17.502Z","logger":"dynakube-controller","msg":"problem with token detected","dynakube":"dynakube","token":"APIToken","msg":"Token on secret dynatrace:dynakube missing scopes [activeGateTokenManagement.create]"}
Example error on DynaKube status:
status:
...
conditions:
- message: Token on secret dynatrace:dynakube missing scopes [activeGateTokenManagement.create]
reason: TokenScopeMissing
status: "False"
type: APIToken
Starting Dynatrace Operator version 0.9.0, Dynatrace Operator handles the ActiveGate token by default. If you're getting one of these errors, follow the instructions below, according to your Dynatrace Operator version.
- For Dynatrace Operator versions earlier than 0.7.0: you need to upgrade to the latest Dynatrace Operator version.
- For Dynatrace Operator version 0.7.0 or later, but earlier than version 0.9.0: you need to create a new API token. For instructions, see Tokens and permissions required: Dynatrace Operator token.
ImagePullBackoff
error on OneAgent and ActiveGate pods
The underlying host's container runtime doesn't contain the certificate presented by your endpoint.
The skipCertCheck
field in the DynaKube YAML does not control this certificate check.
Example error (the error message may vary):
desc = failed to pull and unpack image "<environment>/linux/activegate:latest": failed to resolve reference "<environment>/linux/activegate:latest": failed to do request: Head "<environment>/linux/activegate/manifests/latest": x509: certificate signed by unknown authority
Warning Failed ... Error: ErrImagePull
Normal BackOff ... Back-off pulling image "<environment>/linux/activegate:latest"
Warning Failed ... Error: ImagePullBackOff
In this example, if the description on your pod shows x509: certificate signed by unknown authority
, you must fix the certificates on your Kubernetes hosts, or use the private repository configuration to store the images.
There was an error with the TLS handshake
The certificate for the communication is invalid or expired. If you're using a self-signed certificate, check the mitigation procedures for the ActiveGate.
Invalid bearer token
The bearer token is invalid and the request has been rejected by the Kubernetes API. Verify the bearer token. Make sure it doesn't contain any whitespaces. If you're connecting to a Kubernetes cluster API via a centralized external role-based access control (RBAC), consult the documentation of the Kubernetes cluster manager. For Rancher, see the guidelines on the official Rancher website.
Could not check credentials. Process is started by other user
There is already a request pending for this integration with an ActiveGate. Wait for a couple minutes and check back.
Internal error occurred: failed calling webhook (…) x509: certificate signed by unknown authority
If you get this error after applying the DynaKube custom resource, your Kubernetes API server may be configured with a proxy. You need to exclude https://dynatrace-webhook.dynatrace.svc
from that proxy.
OneAgent unable to connect when using Istio
cloudNativeFullStack applicationMonitoring
Example error in the logs on the OneAgent pods: Initial connect: not successful - retrying after xs
.
You can fix this problem by increasing the OneAgent timeout. Add the following feature flag to DynaKube:
Be sure to replace the placeholder (<...>
) with the name of your DynaKube custom resource.
kubectl annotate dynakube <name-of-your-DynaKube> feature.dynatrace.com/oneagent-initial-connect-retry-ms=6000 -n dynatrace
Connectivity issues when using Calico
If you use Calico to handle or restrict network connections, you might experience connectivity issues, such as:
- The operator, webhook, and CSI driver pods are constantly restarting
- The operator cannot reach the API
- The CSI driver fails to download OneAgent
- Injection into pods doesn't work
If you experience these or similar problems, use our GitHub sample policies for common problems.
- For the
activegate-policy.yaml
anddynatrace-policies.yaml
policies, if Dynatrace Operator isn't installed in thedynatrace
namespace (Kubernetes) or project (OpenShift), you need to adapt the metadata and namespace properties in the YAML files accordingly. - The purpose of the
agent-policy.yaml
andagent-policy-external-only.yaml
policies is to let OneAgents that are injected into pods open external connections. Onlyagent-policy-external-only.yaml
is required, whileagent-policy.yaml
allows internal connections to be made, such as pod-to-pod connections, where needed. - Because these policies are needed for all pods where OneAgent injects, you also need to adapt the
podSelector
property of the YAML files.
Potential issues when changing the monitoring mode
- Changing the monitoring mode from
classicFullStack
tocloudNativeFullStack
affects the host ID calculations for monitored hosts, leading to new IDs being assigned and no connection between old and new entities. - If you want to change the monitoring method from
applicationMonitoring
orcloudNativeFullstack
toclassicFullstack
orhostMonitoring
, you need to restart all the pods that were previously instrumented withapplicationMonitoring
orcloudNativeFullstack
.