Kafka monitoring

Deprecation notice

This extension documentation is now deprecated and will no longer be updated. We recommend using the new Kafka extension for improved functionality and support.

Apache Kafka is an open-source, distributed publish-subscribe message bus designed to be fast, scalable, and durable. Dynatrace automatically recognizes Kafka processes and instantly gathers Kafka metrics on the process and cluster levels.

For information on general Kafka message queue monitoring, see Custom messaging services.

Prerequisites

Dynatrace SaaS/Managed version 1.155+
Apache Kafka or Confluent-supported Kafka 0.9.0.1+
If you have more than one Kafka cluster, separate the clusters into individual process groups via an environment variable in Dynatrace settings

Activation

Go to Settings.
Select Monitoring > Monitored technologies.
Find Kafka and turn on the Global monitoring switch.
After you turn Kafka monitoring on, Dynatrace automatically activates Kafka monitoring on all hosts and monitors all Kafka components.

Events

Name	Condition	Dynatrace event
Under-replicated partitions	Partition followers are out-of-sync with the leader	Performance (PERFORMANCE_EVENT)
Offline partitions	There are no partition leaders	Performance (PERFORMANCE_EVENT)
Cluster controller mismatch	There are multiple controllers detected by brokers	Error (ERROR_EVENT)

To customize problem detection thresholds for Kafka

Go to Settings.
Open Anomaly detection > Extension events and find Kafka in the list.

Metrics

Cluster metrics

Metric	Description
Partitions	All partition replicas available on this broker. The leader partition counts as a partition replica. This should be even across the cluster.
Under replicated partitions	The number of under-replicated partitions in the cluster. Under-replicated partitions indicate that replication is ongoing, consumers aren’t getting data, and latency is growing.
Offline partitions	The number of partitions without active leaders and thus not writable.
Active cluster controllers	The number of active controllers in the cluster. An alert is raised if the aggregated sum across all brokers in the cluster is anything other than 1, because there should be exactly one controller per cluster.

Broker metrics

Metric	Description
Mean time	Time taken to flush the partition log to disk either exceeds time to flush or exceeds maximum size.
95th percentile	The 95th percentile of log flush time. Even a slight log flush time change can drastically affect Kafka performance.
Incoming byte rate	The incoming broker byte rate throughput from clients (consumers, producers, and connectors).
Outgoing byte rate	The outgoing broker byte rate throughput from clients (consumers, producers, and connectors).
Partitions	All partition replicas available on this broker. The leader partition counts as a partition replica. This should be even across the cluster.
Under replicated partitions	The number of under-replicated partitions.
Produce request rate	The produce request rate.
Failed produce requests	The rate of produce requests that failed.
Produce latency	The produce latency.
Fetch request rate	The fetch request rate.
Failed fetch requests	The number of failed fetch requests.
Leader election rate	Election rates go up when there are broker failures.
Unclean election rate	Unclean election rate.
Leader count	Partition leaders on this broker.
Request queue size	Size of the request queue. A congested request queue will not be able to process incoming or outgoing requests.
Messages in rate	Messages in rate.
Max follower lag	Maximum lag in messages between the follower and leader replicas. This is controlled by the `replica.lag.max.messages` config. Lag is measured as the difference in offset between follower broker and leader broker. Max lag is the lag of the partition that is the most out of sync.
ZooKeeper disconnects	The ZooKeeper client is disconnected from the ensemble: the client has lost its connection to a server and is trying to reconnect. The session is not necessarily expired.
ZooKeeper expires	The ZooKeeper session expire rate. When a session expires, we can have leader changes and even a new controller. It is important to keep an eye on the number of such events across a Kafka cluster. If the overall number is high: Check the health of your network Check for garbage collection issues and tune it accordingly If necessary, increase the session time out by setting the value of `zookeeper.session.timeout.ms`

Request metrics

Metric	Description
Requests per second	Requests per second.
Total time per request	Total time per request.

Kafka producer, consumer, and connect metrics

Metric	Description
Requests	Number of requests processed per second by client.
Request size	Average size of request in a one-minute frame.
Incoming/outgoing byte rate	Processed byte rate by client.

Kafka monitoring

Prerequisites

Activation

Events

Metrics

Related topics