Databricks

Databricks

Monitor your Databricks Clusters via its multiple APIs!

Extension

Product information
Release notes

Overview

This OneAgent Extension allows you to collect metrics from your embedded Ganglia instance, the Apache Spark APIs, and/or the Databricks API on your Databricks Cluster.

NOTE: Databricks Runtime v13+ no longer supports Ganglia, please use the Spark and Databricks API Options within the configuration.

This is intended for users who:

Have Databricks cluster(s) they would like to monitor job status' and other important job and cluster level metrics
Look to analyze uptime and autoscaling issues of your Databricks Cluster(s)

This enables you to:

Monitor both job, cluster and infrastructure metrics
Detect long upscaling times
Detect and filter Driver and Worker types

Get started

Define in the configuration which metrics you'd like to collect from your Databricks Clusters
Set up a global init script on your Databricks Cluster to download the Dynatrace OneAgent
Restart your Databricks Cluster to enable the Dynatrace OneAgent and this extension.

Details

Ensure the EEC is enabled on each host. This can be done globally from
- Settings -> Preferences -> Extension Execution Controller (/ui/settings/builtin:eec.local)

and turning on the first two options.

Create Databricks API Token from inside your Databricks Cluster
- User Settings -> Create API Token
Copy your Databricks URL
Copy Linux OA Installation wget command from
- Deploy Dynatrace -> Start Installation button -> Linux button -> Enter or Create Paas Token (#install/agentlinux;gf=all)

**NOTE: Databricks Clusters can go up and down quickly causing multiple HOST entities within Dynatrace. Databricks reuses IP addresses so if you'd like to have the same HOST entities for your clusters you can add this flag --set-host-id-source="ip-addresses" to the OneAgent installation command in your Global Init Script. For Example :

/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"

Configuration for Apache Spark & Databricks API Metrics (Recommended)

Set up Global Init Script on Databricks Cluster
- Change Dynatrace Tenant & API Token values
- NOTE: If your Databricks Cluster does not have network access to your Dynatrace Cluster or ActiveGate, the Dynatrace-OneAgent-Linux.sh file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget command.
```
#!/usr/bin/env bash

wget  -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"
```
Configure OneAgent Extension in Dynatrace Cluster from
- Extensions -> Databricks (/ui/hub/ext/com.dynatrace.databricks) -> Add Monitoring Configuration -> Select Databricks Hosts ->
- Enable Call Spark API Slider
- Enable Call Databricks API Slider
  - Enter in your Databricks URL and User Token
Select which Feature Sets of Metrics you'd like to Capture
Restart your Databricks Clusters
Verify Metrics show up on the HOST screen of your Databricks Cluster's Driver Node. All the metrics will be attached to that HOST entity.

Configuration for Ganglia (Legacy)

Create Dynatrace API with ReadConfig Permissions

Set up Global Init Script on Databricks Cluster

Change Dynatrace Tenant & API Token values
Change DB_WS_URL & DB_WS_TOKEN Values (from steps above)
NOTE: If your Databricks Cluster does not have network access to your Dynatrace Cluster or ActiveGate, the OneAgent.sh and extension zip file can be manually uploaded to you Databricks DBFS and the script below can be modified to use those locations instead of using the wget commands.

#!/usr/bin/env bash

wget  -O Dynatrace-OneAgent-Linux.sh "https://<TENANT>.live.dynatrace.com/api/v1/deployment/installer/agent/unix/default/latest?arch=x86&flavor=default" --header="Authorization: Api-Token <Installer API-TOKEN>"
/bin/sh Dynatrace-OneAgent-Linux.sh --set-infra-only=true --set-app-log-content-access=true --set-host-id-source="ip-addresses"

# token with 'ReadConfig' permissions
wget -O custom_python_databricks_ganglia.zip "https://<TENANT>.live.dynatrace.com/api/config/v1/extensions/custom.python.databricks_ganglia/binary" --header="Authorization: Api-Token <ReadConfig API-TOKEN>"
unzip custom_python_databricks_ganglia.zip -d /opt/dynatrace/oneagent/plugin_deployment/

# Add Databricks Workspace URL Environment Variable
cat <<EOF | sudo tee /etc/databricks_env
DB_WS_URL=https://adb-XXXXXXXXX.XX.azuredatabricks.net
DB_WS_TOKEN=dapiXXXXXXXXXXXXXXXXXXXXXXXXXXX
EOF

Create Dynatrace API Token with entities.read & entitie.write permissions.
Configure OneAgent Extension in Dynatrace Cluster from
- Extensions -> Databricks -> Add Monitoring Configuration -> Select Databricks Hosts -> Enable Call Ganglia API Slider (/ui/hub/ext/com.dynatrace.databricks)
1. Enter in Dynatrace Tenant URL and API Token
2. Select which metrics you'd like to capture from Ganglia
Restart Databricks Cluster(s)
Verify Metrics are showing up on the included Dashboard

Subscribe to new releases

Copy to clipboard

Extension content

Content typeNumber of items included

screen metric tables

1

metric metadata

74

dashboards

1

screen injections

9

screen chart groups

8

Feature sets

Below is a complete list of the feature sets provided in this version. To ensure a good fit for your needs, individual feature sets can be activated and deactivated by your administrator during configuration.

Feature setsNumber of metrics included

Metric name	Metric key	Description	Unit
Databricks Cluster Upsizing Time	databricks.cluster.upsizing_time	Time spent upsizing cluster	MilliSecond

Metric name	Metric key	Description	Unit
Executor RDD Blocks	databricks.spark.executor.rdd_blocks	-	Count
Executor Memory Used	databricks.spark.executor.memory_used	-	Byte
Executor Disk Used	databricks.spark.executor.disk_used	-	Byte
Executor Active Tasks	databricks.spark.executor.active_tasks	-	Count
Executor Failed Tasks	databricks.spark.executor.failed_tasks	-	Count
Executor Completed Tasks	databricks.spark.executor.completed_tasks	-	Count
Executor Total Tasks	databricks.spark.executor.total_tasks	-	Count
Executor Duration	databricks.spark.executor.total_duration.count	-	MilliSecond
Executor Input Bytes	databricks.spark.executor.total_input_bytes.count	-	Byte
Executor Shuffle Read	databricks.spark.executor.total_shuffle_read.count	-	Byte
Executor Shuffle Write	databricks.spark.executor.total_shuffle_write.count	-	Byte
Executor Max Memory	databricks.spark.executor.max_memory	-	Byte
Executor Alive Count	databricks.spark.executor.alive_count.gauge	-	Count
Executor Dead Count	databricks.spark.executor.dead_count.gauge	-	Count

Metric name	Metric key	Description	Unit
CPU User %	databricks.hardware.cpu.usr	-	Percent
CPU Nice %	databricks.hardware.cpu.nice	-	Percent
CPU System %	databricks.hardware.cpu.sys	-	Percent
CPU IOWait %	databricks.hardware.cpu.iowait	-	Percent
CPU IRQ %	databricks.hardware.cpu.irq	-	Percent
CPU Steal %	databricks.hardware.cpu.steal	-	Percent
CPU Idle %	databricks.hardware.cpu.idle	-	Percent
Memory Used	databricks.hardware.mem.used	-	Byte
Memory Total	databricks.hardware.mem.total	-	KiloByte
Memory Free	databricks.hardware.mem.free	-	KiloByte
Memory Buff/Cache	databricks.hardware.mem.buff_cache	-	KiloByte

Metric name	Metric key	Description	Unit
Job Status	databricks.spark.job.status	-	Unspecified
Job Duration	databricks.spark.job.duration	-	Second
Job Total Tasks	databricks.spark.job.total_tasks	-	Count
Job Active Tasks	databricks.spark.job.active_tasks	-	Count
Job Skipped Tasks	databricks.spark.job.skipped_tasks	-	Count
Job Failed Tasks	databricks.spark.job.failed_tasks	-	Count
Job Completed Tasks	databricks.spark.job.completed_tasks	-	Count
Job Active Stages	databricks.spark.job.active_stages	-	Count
Job Completed Stages	databricks.spark.job.completed_stages	-	Count
Job Skipped Stages	databricks.spark.job.skipped_stages	-	Count
Job Failed Stages	databricks.spark.job.failed_stages	-	Unspecified

Metric name	Metric key	Description	Unit
Stage Active Tasks	databricks.spark.job.stage.num_active_tasks	-	Count
Stage Completed Tasks	databricks.spark.job.stage.num_complete_tasks	-	Count
Stage Failed Tasks	databricks.spark.job.stage.num_failed_tasks	-	Count
Stage Killed Tasks	databricks.spark.job.stage.num_killed_tasks	-	Count
Stage Executor Run Time	databricks.spark.job.stage.executor_run_time	-	MilliSecond
Stage Input Bytes	databricks.spark.job.stage.input_bytes	-	Byte
Stage Input Records	databricks.spark.job.stage.input_records	-	Count
Stage Output Bytes	databricks.spark.job.stage.output_bytes	-	Byte
Stage Output Records	databricks.spark.job.stage.output_records	-	Count
Stage Shuffle Read Bytes	databricks.spark.job.stage.shuffle_read_bytes	-	Byte
Stage Shuffle Read Records	databricks.spark.job.stage.shuffle_read_records	-	Count
Stage Shuffle Write Bytes	databricks.spark.job.stage.shuffle_write_bytes	-	Byte
Stage Shuffle Write Records	databricks.spark.job.stage.shuffle_write_records	-	Count
Stage Memory Bytes Spilled	databricks.spark.job.stage.memory_bytes_spilled	-	Byte
Stage Disk Bytes Spilled	databricks.spark.job.stage.disk_bytes_spilled	-	Byte

Metric name	Metric key	Description	Unit
Application Count	databricks.spark.application_count.gauge	-	Count

Metric name	Metric key	Description	Unit
Streaming Batch Duration	databricks.spark.streaming.statistics.batch_duration	-	MilliSecond
Streaming Receivers	databricks.spark.streaming.statistics.num_receivers	-	Count
Streaming Active Receivers	databricks.spark.streaming.statistics.num_active_receivers	-	Count
Streaming Inactive Receivers	databricks.spark.streaming.statistics.num_inactive_receivers	-	Count
Streaming Completed Batches	databricks.spark.streaming.statistics.num_total_completed_batches.count	-	Count
Streaming Retained Completed Batches	databricks.spark.streaming.statistics.num_retained_completed_batches.count	-	Unspecified
Streaming Active Batches	databricks.spark.streaming.statistics.num_active_batches	-	Count
Streaming Processed Records	databricks.spark.streaming.statistics.num_processed_records.count	-	Count
Streaming Received Records	databricks.spark.streaming.statistics.num_received_records.count	-	Count
Streaming Avg Input Rate	databricks.spark.streaming.statistics.avg_input_rate	-	Byte
Streaming Avg Scheduling Delay	databricks.spark.streaming.statistics.avg_scheduling_delay	-	MilliSecond
Streaming Avg Processing Time	databricks.spark.streaming.statistics.avg_processing_time	-	MilliSecond
Streaming Avg Total Delay	databricks.spark.streaming.statistics.avg_total_delay	-	MilliSecond

Metric name	Metric key	Description	Unit
RDD Count	databricks.spark.rdd_count.gauge	-	Count
RDD Partitions	databricks.spark.rdd.num_partitions	-	Count
RDD Cached Partitions	databricks.spark.rdd.num_cached_partitions	-	Count
RDD Memory Used	databricks.spark.rdd.memory_used	-	Byte
RDD Disk Used	databricks.spark.rdd.disk_used	-	Byte

Full version history

To have more information on how to install the downloaded package, please follow the instructions on this page.

ReleaseDate

Full version history

v1.5.5

New Feature Set - Hardware Metrics
DXS-1597
- Adds new configuration option - Aggregate Dimensions for Spark API Metrics
Updates to how Spark API is called
UA Screen updates
DXS-1920
- Adds retry logic to determine driver node during start up of extension
Adds ability to ingest Spark Jobs as traces
- NOTE : Depending on the number of Spark Jobs, this could be a significant amount of traces and could increase licensing costs.
Adds ability to ingest Spark Config as Log Messages

Full version history

##v1.02

Initial Release of Extensions 2.0 version of Databricks Extension
Offers Support for Ganglia APIs (Legacy), Apache Spark APIs, and Databricks APIs

Discover recent additions to Dynatrace

Problems

Analyze abnormal system behavior and performance problems detected by Davis AI.

Logs

Explore all your logs without writing a single query.

Security Investigator

Fast and precise forensics for security and logs on Grail data with DQL queries.

Business Flow

Track, analyze, and optimize your critical business processes.

Carbon Impact

Provides the information about site's carbon footprint

Davis Anomaly Detection

Detect anomalies in timeseries using the Davis AI

Analyze your data

Understand your data better with deep insights and clear visualizations.

Notebooks

Create powerful, data-driven documents for custom analytics and collaboration.

Dashboards

Transform complex data into clear visualizations with custom dashboards.

Automate your processes

Turn data and answers into actions, securely, and at scale.

Workflows

Automate tasks in your IT landscape, remediate problems, and visualize processes

Jira for Workflows

Create, query, comment, transition, and resolve Jira tickets within workflows.

Slack

Automate Slack messaging for security incidents, attacks, remediation, and more.

Secure your cloud application

See vulnerabilities and attacks in your environment.

Security Overview

Get a comprehensive overview of the security of your applications.

Third-Party Vulnerabilities

Detect vulnerabilities in real time, with context and automated risk assessment.

Code-Level Vulnerabilities

Detect vulnerabilities in your code in real time.

Attacks

Get a real-time overview of all the attacks on your environment.

Are you looking for something different?

We have hundreds of apps, extensions, and other technologies to customize your environment

Leverage our newest innovations of Dynatrace Saas

Kick-start your app creation

Kick-start your app creation

Whether you’re a beginner or a pro, Dynatrace Developer has the tools and support you need to create incredible apps with minimal effort.

Go to Dynatrace Developer

Upgrading from Dynatrace Managed to SaaS

Upgrading from Dynatrace Managed to SaaS

Drive innovation, speed, and agility in your organization by seamlessly and securely upgrading.

Log Management and Analytics

Log Management and Analytics

Innovate faster and more efficiently with unified log management and log analytics for actionable insights and automation.