Databricks monitoring & observability

Metric name	Metric key	Description	Unit
RDD Count	databricks.spark.rdd_count.gauge	Total number of Resilient Distributed Datasets currently tracked by the Spark application	Count
RDD Partitions	databricks.spark.rdd.num_partitions	Total number of partitions across all Resilient Distributed Datasets	Count
RDD Cached Partitions	databricks.spark.rdd.num_cached_partitions	Number of Resilient Distributed Dataset partitions currently cached in memory or disk	Count
RDD Memory Used	databricks.spark.rdd.memory_used	Amount of memory used to store Resilient Distributed Dataset data	Byte
RDD Disk Used	databricks.spark.rdd.disk_used	Amount of disk space used to store Resilient Distributed Dataset data	Byte

Metric name	Metric key	Description	Unit
Streaming Batch Duration	databricks.spark.streaming.statistics.batch_duration	Time interval configured for each streaming batch	MilliSecond
Streaming Receivers	databricks.spark.streaming.statistics.num_receivers	Total number of receivers configured for the streaming job	Count
Streaming Active Receivers	databricks.spark.streaming.statistics.num_active_receivers	Number of receivers actively ingesting data	Count
Streaming Inactive Receivers	databricks.spark.streaming.statistics.num_inactive_receivers	Number of receivers that are currently inactive	Count
Streaming Completed Batches	databricks.spark.streaming.statistics.num_total_completed_batches.count	Total number of batches that have been fully processed	Count
Streaming Retained Completed Batches	databricks.spark.streaming.statistics.num_retained_completed_batches.count	Number of completed batches retained in memory for monitoring or debugging	Unspecified
Streaming Active Batches	databricks.spark.streaming.statistics.num_active_batches	Number of streaming batches currently being processed	Count
Streaming Processed Records	databricks.spark.streaming.statistics.num_processed_records.count	Total number of records processed across all batches	Count
Streaming Received Records	databricks.spark.streaming.statistics.num_received_records.count	Total number of records received from all sources	Count
Streaming Avg Input Rate	databricks.spark.streaming.statistics.avg_input_rate	Average number of records received per second across batches	Byte
Streaming Avg Scheduling Delay	databricks.spark.streaming.statistics.avg_scheduling_delay	Average delay between batch creation and start of processing	MilliSecond
Streaming Avg Processing Time	databricks.spark.streaming.statistics.avg_processing_time	Average time taken to process each batch	MilliSecond
Streaming Avg Total Delay	databricks.spark.streaming.statistics.avg_total_delay	Average total delay from data ingestion to processing completion	MilliSecond

Metric name	Metric key	Description	Unit
Application Count	databricks.spark.application_count.gauge	Number of apps running databricks	Count

Metric name	Metric key	Description	Unit
Stage Active Tasks	databricks.spark.job.stage.num_active_tasks	Number of tasks currently running in the stage	Count
Stage Completed Tasks	databricks.spark.job.stage.num_complete_tasks	Number of tasks that have successfully completed in the stage	Count
Stage Failed Tasks	databricks.spark.job.stage.num_failed_tasks	Number of tasks that failed during execution in the stage	Count
Stage Killed Tasks	databricks.spark.job.stage.num_killed_tasks	Number of tasks that were killed (e.g., due to job cancellation or speculative execution)	Count
Stage Executor Run Time	databricks.spark.job.stage.executor_run_time	Total time executors spent running tasks in the stage	MilliSecond
Stage Input Bytes	databricks.spark.job.stage.input_bytes	Total number of bytes read from input sources in the stage	Byte
Stage Input Records	databricks.spark.job.stage.input_records	Total number of records read from input sources in the stage	Count
Stage Output Bytes	databricks.spark.job.stage.output_bytes	Total number of bytes written to output destinations in the stage	Byte
Stage Output Records	databricks.spark.job.stage.output_records	Total number of records written to output destinations in the stage	Count
Stage Shuffle Read Bytes	databricks.spark.job.stage.shuffle_read_bytes	Total bytes read from other executors during shuffle operations	Byte
Stage Shuffle Read Records	databricks.spark.job.stage.shuffle_read_records	Total records read from other executors during shuffle operations	Count
Stage Shuffle Write Bytes	databricks.spark.job.stage.shuffle_write_bytes	Total bytes written to other executors during shuffle operations	Byte
Stage Shuffle Write Records	databricks.spark.job.stage.shuffle_write_records	Total records written to other executors during shuffle operations	Count
Stage Memory Bytes Spilled	databricks.spark.job.stage.memory_bytes_spilled	Amount of data spilled to memory due to shuffle or aggregation operations	Byte
Stage Disk Bytes Spilled	databricks.spark.job.stage.disk_bytes_spilled	Amount of data spilled to disk due to insufficient memory during task execution	Byte

Metric name	Metric key	Description	Unit
Job Status	databricks.spark.job.status	Current status of the job (e.g., running, succeeded, failed)	Unspecified
Job Duration	databricks.spark.job.duration	Total time taken by the job from start to finish	Second
Job Total Tasks	databricks.spark.job.total_tasks	Total number of tasks planned for the job	Count
Job Active Tasks	databricks.spark.job.active_tasks	Number of tasks currently executing within the job	Count
Job Skipped Tasks	databricks.spark.job.skipped_tasks	Number of tasks skipped due to earlier failures or optimizations	Count
Job Failed Tasks	databricks.spark.job.failed_tasks	Number of tasks that failed during job execution	Count
Job Completed Tasks	databricks.spark.job.completed_tasks	Total number of tasks that have successfully completed	Count
Job Active Stages	databricks.spark.job.active_stages	Number of stages currently running in a Spark job	Count
Job Completed Stages	databricks.spark.job.completed_stages	Total number of stages that have successfully completed	Count
Job Skipped Stages	databricks.spark.job.skipped_stages	Number of stages skipped due to earlier failures or optimizations	Count
Job Failed Stages	databricks.spark.job.failed_stages	Number of stages that failed during job execution	Unspecified
Job Count	databricks.spark.job_count.gauge	Total number of Spark jobs submitted	Count

Metric name	Metric key	Description	Unit
CPU User %	databricks.hardware.cpu.usr	Percentage of CPUs time spent on User processes	Percent
CPU Nice %	databricks.hardware.cpu.nice	Percentage of CPU time used by processes that have a positive niceness, meaning a lower priority than other tasks	Percent
CPU System %	databricks.hardware.cpu.sys	Percentage of CPUs time spent on System processes	Percent
CPU IOWait %	databricks.hardware.cpu.iowait	Percentage of time CPU spends idle while waiting for I/O operations to complete	Percent
CPU IRQ %	databricks.hardware.cpu.irq	Interrupt Request Percentage, Proportion of CPU time spent handling hardware interrupts requests	Percent
CPU Steal %	databricks.hardware.cpu.steal	Percentage of time a virtual CPU waits for physical CPU while hypervisor is servicing another virtual processor	Percent
CPU Idle %	databricks.hardware.cpu.idle	Percentage of CPU idling	Percent
Memory Used	databricks.hardware.mem.used	Total memory currently in use, including buffers and cache	Byte
Memory Total	databricks.hardware.mem.total	Total physical memory installed on the system	KiloByte
Memory Free	databricks.hardware.mem.free	Portion of memory that is completely unused and available	KiloByte
Memory Buff/Cache	databricks.hardware.mem.buff_cache	Memory used by the system for buffers and cache to improve performance	KiloByte
Memory Shared	databricks.hardware.mem.shared	Memory shared between processes	KiloByte
Memory Available	databricks.hardware.mem.available	Total amount of memory available for use by the system	KiloByte

Metric name	Metric key	Description	Unit
Executor RDD Blocks	databricks.spark.executor.rdd_blocks	Number of Resilient Distributed Dataset blocks stored in memory or disk by the executor	Count
Executor Memory Used	databricks.spark.executor.memory_used	The amount of memory currently used by the executor for execution and storage tasks	Byte
Executor Disk Used	databricks.spark.executor.disk_used	Disk used by the Spark executor	Byte
Executor Active Tasks	databricks.spark.executor.active_tasks	Total number of tasks that are currently executing on the specified executor within the Databricks Cluster	Count
Executor Failed Tasks	databricks.spark.executor.failed_tasks	Number of failed tasks on the Spark executor	Count
Executor Completed Tasks	databricks.spark.executor.completed_tasks	Number of completed tasks on the Spark Application	Count
Executor Total Tasks	databricks.spark.executor.total_tasks	Total number of tasks executed by the executor	Count
Executor Duration	databricks.spark.executor.total_duration.count	Time taken by Spark executor to complete a task	MilliSecond
Executor Input Bytes	databricks.spark.executor.total_input_bytes.count	Total number of Bytes read by a Spark task from its input source	Byte
Executor Shuffle Read	databricks.spark.executor.total_shuffle_read.count	Total data read by the executor during shuffle operations (from other executors)	Byte
Executor Shuffle Write	databricks.spark.executor.total_shuffle_write.count	Total data written by the executor during shuffle operations (to other executors)	Byte
Executor Max Memory	databricks.spark.executor.max_memory	The maximum amount of memory allocated to the executor by Spark	Byte
Executor Alive Count	databricks.spark.executor.alive_count.gauge	Number of tasks that are currently running on the Databricks Cluster	Count
Executor Dead Count	databricks.spark.executor.dead_count.gauge	Number of dead tasks on the Spark application	Count

Metric name	Metric key	Description	Unit
Databricks Cluster Upsizing Time	databricks.cluster.upsizing_time	Time spent upsizing cluster	MilliSecond

All

198 Results filtered by:

Palo Alto firewalls

Palo Alto extension for problems detection

Extension

Confluent Cloud (Kafka)

Remotely monitor your Confluent Cloud Kafka Clusters and other resources!

Extension

Kong - Prometheus

Monitor Prometheus metrics exposed by Kong and proxied upstream services

Extension

Nutanix Clusters

Monitor Nutanix clusters' performance, usage and availability, with Nutanix API.

Extension

Luna Network HSM Device

Monitor your Luna Network Hardware Security Module (HSM) Devices through SNMP.

Extension

Consul Service Mesh (StatsD)

Extend visibility into your Consul Service Mesh instances to monitor health and improve performance.

Extension

Microsoft IIS

Flexible and secure web server for hosting with Windows Server.

Extension

Kubernetes Monitoring Statistics

Troubleshoot your Dynatrace Kubernetes monitoring and Prometheus integration.

Extension

Snyk

Ingest Snyk vulnerability findings, scans, and audit logs.

Extension

Citrix DaaS & Virtual Apps and Desktops

Gain insight into your Citrix DaaS & Virtual Apps and Desktops environments

Extension

Google Memorystore

Get insights into Google Memorystore service metrics collected from the Google Operations API to ensure health of your cloud infrastructure.

Extension

Databricks Workspace

Remotely monitor your Databricks Workspaces!

Extension

UPS Device

Monitor your Uninterruptible Power Supplies (UPS) over SNMP

Extension

Google App Engine (integration)

Insights into Google App Engine service metrics collected from Operations API

Extensioncoming soon

Traceroute

Run traceroute commands and collect step performance metrics

Extension

[Deprecated] Kubernetes PVCs

Monitor your Kubernetes persistent volume claims and alert on capacity limits.

Extension

Google Cloud Storage Transfer

Get insights into Google Cloud Storage Transfer metrics collected from the Google Operations API to ensure health of cloud infrastructure.

Extension

NVIDIA GPU

Monitor base parameters of the GPU, including load, memory and temperature

Extension

Oracle Database

Observe, analyze and optimize the usage, health and performance of your database

Extension

Dell iDRAC

Connect to the Redfish API to get insights into your Dell iDRAC environment

Extension

Cisco ACI/APIC

Get insights into your Cisco Application Centric Infrastructure (ACI)

Extension

Azure Managed Apache Cassandra

Gain insights into your Azure Managed Cassandra Instance health and performance

Extension

PayShield HSM Device

Monitor PayShield Payment Hardware Security Module (HSM) Devices through SNMP.

Extension

NetApp OnTap (Remote)

Remote extension that collects NetApp OnTap metrics from the OnTap 9.6+ API.

Extension

Google Firestore in Datastore mode

Get insights into Google Firestore in Datastore mode metrics collected from the Google Operations API to ensure health of infrastructure.

Extension

Redis (2.0)

Collect important additional data for your Redis instances.

Extension

PHP-FPM

Monitor the PHP-FPM status of your applications with this extension.

Extension

Timedrift Monitoring

Monitor your host's NTP/Chrony Time Offset!

Extension

Apache Kafka

Automatic and intelligent observability with trace and metric insights.

Extension

SNMP Generic Server

Monitor your Servers and Hosts over SNMP

Extension

MongoDB (local or remote monitoring)

Monitor your MongoDB servers either locally or remotely!

Extension

Connection Pools: C3P0

Application server method of pooling and sharing connections to a database.

Extension

AWS Entities for Metric Streaming

Analyse metrics in the context of an entity based on AWS Metric Streaming.

Extension

MongoDB Atlas

Remotely monitor your SaaS installation of MongoDB (Atlas)

Extension

Microsoft SQL Server

Improve the health and performance monitoring of your Microsoft SQL Servers.

Extension

IBM MQ Appliance

Monitor your IBM MQ Appliances over SNMP

Extension

AWS Cloud Monitoring

New and enhanced monitoring capabilities for your AWS cloud platforms

Extension

Google Apigee

Get insights into Google Apigee service metrics collected from the Google Operations API to ensure health of your cloud infrastructure.

Extension

Oracle Base DB and Autonomous DB on OCI

Monitor health of the Oracle Base Service and Autonomous Database.

Extension

Google Pub/Sub Lite

Get insights into Google Pub/Sub Lite service metrics collected from the Google Operations API to ensure health of the cloud infrastructure.

Extension

Infoblox DDI

Monitor Infoblox DDI using SNMP

Extension

SAP HANA Database (remote monitoring)

Easily understand the health and performance of your SAP HANA databases.

Extension

Connection Pools: WebSphere Liberty

Application server method of pooling and sharing connections to a database.

Extension

Google Cloud Composer

Get insights into Google Cloud Composer metrics collected from the Google Operations API to ensure health of your cloud infrastructure.

Extension

Google Cloud Spanner

Get insights into Google Cloud Spanner metrics collected from the Google Operations API to ensure health of your cloud infrastructure.

Extension

IBM i

Collect performance data from your IBM i Hosts via this Remote extension.

Extension

Google reCAPTCHA Enterprise

Get insights into Google reCAPTCHA Enterprise metrics collected from the Google Operations API to ensure health of your cloud infrastructure

Extension

.NET

Automatic end-to-end observability for .NET applications and processes.

Extension

Google Cloud's operations suite

Get insights into Google Cloud's operations suite metrics collected from the Google Operations API to ensure health of cloud infrastructure.

Extension

Google Vertex AI

Get insights into Google Vertex AI service metrics.

Extension

Provided by

Resources

Support

Full version history

1.6.2

Full version history

1.6.1

Full version history

1.6.0

Full version history

v1.5.6

Full version history

v1.5.5

Full version history