Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs) monitoring
Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.
Prerequisites
To enable monitoring for this service, you need
- ActiveGate version 1.181+, as follows:
- For Dynatrace SaaS deployments, you need an Environment ActiveGate or a Multi-environment ActiveGate.
- For Dynatrace Managed deployments, you can use any kind of ActiveGate.
Note: For role-based access (whether in a SaaS or Managed deployment), you need an Environment ActiveGate installed on an Amazon EC2 host.
- Dynatrace version 1.182+
- An updated AWS monitoring policy to include the additional AWS services.
To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.
If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for all services (All monitored Amazon services) and, for each supporting service, a list of optional permissions specific to that service.
Example of JSON policy for one single service.
In this example, from the complete list of permissions you need to select
"apigateway:GET"
for Amazon API Gateway"cloudwatch:GetMetricData"
,"cloudwatch:GetMetricStatistics"
,"cloudwatch:ListMetrics"
,"sts:GetCallerIdentity"
,"tag:GetResources"
,"tag:GetTagKeys"
, and"ec2:DescribeAvailabilityZones"
for All monitored Amazon services.
Enable monitoring
To enable monitoring for this service, you first need to integrate Dynatrace with Amazon Web Services:
Add the service to monitoring
In order to view the service metrics, you must add the service to monitoring in your Dynatrace environment.
Note: Once AWS cloud services are added to monitoring, you might have to wait 15-20 minutes before the metric values are displayed.
All cloud services consume Davis data units (DDUs). The amount of DDU consumption per service instance depends on the number of monitored metrics and their dimensions (each metric dimension results in the ingestion of 1 data point; 1 data point consumes 0.001 DDUs).
Monitor resources based on tags
You can choose to monitor resources based on existing AWS tags, as Dynatrace automatically imports them from service instances. Nevertheless, the transition from AWS to Dynatrace tagging isn't supported for all AWS services. Expand the table below to see which cloud services are filtered by tagging.
To monitor resources based on tags
- In the Dynatrace menu, go to Settings > Cloud and virtualization > AWS and select Edit for the desired AWS instance.
- For Resources to be monitored, select Monitor resources selected by tags.
- Enter the Key and Value.
- Select Save.
Configure service metrics
Once you add a service, Dynatrace starts automatically collecting a suite of metrics for this particular service. These are recommended metrics. Apart from the recommended metrics, most services have the possibility of enabling optional metrics. You can remove or edit any of the existing metrics or any of their dimensions, where there are multiple dimensions available. Metrics consisting of only one dimension can't be edited. They can only be removed or added.
Service-wide metrics are metrics for the whole service across all regions. Typically, these metrics include dimensions containing Region in their name. If selected, these metrics are displayed on a separate chart when viewing your AWS deployment in Dynatrace. Keep in mind that available dimensions differ among services.
To change a metric's statistics, you have to recreate that metric by choosing different statistics. You can choose among the following statistics: Sum, Minimum, Maximum, Average, and Sample count. The Average + Minimum + Maximum statistics enable you to collect all three statistics as one metric instead of one statistic for three metrics separately. This can reduce your expenses for retrieving metrics from your AWS deployment.
To be able to save a newly added metric, you need to select at least one statistic and one dimension.
Note: Once AWS cloud services are configured, you might have to wait 15-20 minutes before the metric values are displayed.
View service metrics
You can view the service metrics in your Dynatrace environment either on the custom device overview page or on your Dashboards page.
View metrics on the custom device overview page
To access the custom device overview page
- In the Dynatrace menu, go to Technologies and processes.
- Filter by service name and select the relevant custom device group.
- Once you select the custom device group, you're on the custom device group overview page.
- The custom device group overview page lists all instances (custom devices) belonging to the group. Select an instance to view the custom device overview page.
View metrics on your dashboard
You can also view metrics in the Dynatrace web UI on dashboards. There is no preset dashboard available for this service, but you can create your own dashboard.
To check the availability of preset dashboards for each AWS service, see the list below.
Available metrics
Amazon SageMaker Batch Transform Jobs
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | Region, Host | ✔️ |
GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
DiskUtilization | The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100% . This metric is not supported for batch transform jobs. | Percent | Average | EndpointName, VariantName | ✔️ |
GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | Region, Host | ✔️ |
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | Region, Host | ✔️ |
Amazon SageMaker Endpoint Instances
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
CPUUtilization | The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100% , and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | ✔️ |
DiskUtilization | The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100% . This metric is not supported for batch transform jobs. | Percent | Average | EndpointName, VariantName | ✔️ |
GPUMemoryUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | ✔️ |
GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. | Percent | Average | EndpointName, VariantName | ✔️ |
LoadedModelCount | The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. | None | Average | EndpointName, VariantName | |
LoadedModelCount | None | Sum | EndpointName, VariantName | ||
MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100% . | Percent | Average | EndpointName, VariantName | ✔️ |
Amazon SageMaker Endpoints
Name | Description | Unit | Statistics | Dimensions | Recommended |
---|---|---|---|---|---|
Invocation4XXErrors | The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. | None | Average | EndpointName, VariantName | |
Invocation4XXErrors | None | Sum | EndpointName, VariantName | ||
Invocation5XXErrors | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. | None | Average | EndpointName, VariantName | |
Invocation5XXErrors | None | Sum | EndpointName, VariantName | ✔️ | |
Invocations | The number of InvokeEndpoint requests sent to a model endpoint | None | Sum | EndpointName, VariantName | ✔️ |
Invocations | None | Count | EndpointName, VariantName | ||
InvocationsPerInstance | The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant . 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the ProductionVariant behind the endpoint at the time of the request. | None | Sum | EndpointName, VariantName | |
ModelCacheHit | The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded | None | Sum | EndpointName, VariantName | |
ModelCacheHit | None | Average | EndpointName, VariantName | ||
ModelCacheHit | None | Count | EndpointName, VariantName | ||
ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | Microseconds | Multi | EndpointName, VariantName | ✔️ |
ModelLatency | Microseconds | Sum | EndpointName, VariantName | ||
ModelLatency | Microseconds | Count | EndpointName, VariantName | ||
ModelLoadingTime | The interval of time that it took to load the model through the container's LoadModel API call. | Microseconds | Multi | EndpointName, VariantName | |
ModelLoadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelLoadingTime | Microseconds | Count | EndpointName, VariantName | ||
ModelLoadingWaitTime | The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference | Microseconds | Multi | EndpointName, VariantName | |
ModelLoadingWaitTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelLoadingWaitTime | Microseconds | Count | EndpointName, VariantName | ||
ModelDownloadingTime | The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3) | Microseconds | Multi | EndpointName, VariantName | |
ModelDownloadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelDownloadingTime | Microseconds | Count | EndpointName, VariantName | ||
ModelUnloadingTime | The interval of time that it took to unload the model through the container's UnloadModel API call | Microseconds | Multi | EndpointName, VariantName | |
ModelUnloadingTime | Microseconds | Sum | EndpointName, VariantName | ||
ModelUnloadingTime | Microseconds | Count | EndpointName, VariantName | ||
OverheadLatency | The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency . | Microseconds | Multi | EndpointName, VariantName | ✔️ |
OverheadLatency | Microseconds | Sum | EndpointName, VariantName | ||
OverheadLatency | Microseconds | Count | EndpointName, VariantName |
Amazon SageMaker Ground Truth
Name | Description | Dimensions | Statistics | Unit | Recommended |
---|---|---|---|---|---|
ActiveWorkers | The number of workers on a private work team performing a labeling job | Region, LabelingJobName | Maximum | None | |
DatasetObjectsAutoAnnotated | The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled. | Region, LabelingJobName | Maximum | None | ✔️ |
DatasetObjectsHumanAnnotated | The number of dataset objects annotated by a human in a labeling job | Region, LabelingJobName | Maximum | None | ✔️ |
DatasetObjectsLabelingFailed | The number of dataset objects that failed labeling in a labeling job | Region, LabelingJobName | Maximum | None | ✔️ |
JobsFailed | The number of labeling jobs that failed | Region | Count | None | |
JobsFailed | Region | Sum | None | ✔️ | |
JobsStopped | The number of labeling jobs that were stopped | Region | Count | None | |
JobsStopped | Region | Sum | None | ||
JobsSucceeded | The number of labeling jobs that succeeded | Region | Count | None | |
JobsSucceeded | Region | Sum | None | ✔️ | |
TasksSubmitted | The number of tasks submitted/completed by a private work team | Region, LabelingJobName | Maximum | None | |
TimeSpent | Time spent on a task completed by a private work team | Region, LabelingJobName | Maximum | Seconds | |
TotalDatasetObjectsLabeled | The number of dataset objects labeled successfully in a labeling job | Region, LabelingJobName | Maximum | None | ✔️ |