Amazon SageMaker (Batch Transform Jobs, Endpoint Instances, Endpoints, Ground Truth, Processing Jobs, Training Jobs)

Dynatrace ingests metrics for multiple preselected namespaces, including Amazon SageMaker. You can view metrics for each service instance, split metrics into multiple dimensions, and create custom charts that you can pin to your dashboards.

Prerequisites

To enable monitoring for this service, you need

  • An Environment or Cluster ActiveGate version 1.181+
  • Dynatrace version 1.182+
  • An updated AWS monitoring policy to include the additional AWS services.
    To update the AWS IAM policy, use the JSON below, containing the monitoring policy (permissions) for all supporting services.

If you don't want to add permissions to all services, and just select permissions for certain services, consult the table below. The table contains a set of permissions that are required for all services (All monitored Amazon services) and, for each supporting service, a list of optional permissions specific to that service.

Example of JSON policy for one single service.

In this example, from the complete list of permissions you need to select

  • "apigateway:GET" for Amazon API Gateway
  • "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "sts:GetCallerIdentity", "tag:GetResources", "tag:GetTagKeys", and "ec2:DescribeAvailabilityZones" for All monitored Amazon services.

Add the service to monitoring

In order to view the service metrics, you must add the service to monitoring in your Dynatrace environment.

Note: Once AWS supporting services are added to monitoring, you might have to wait 15-20 minutes before the metric values are displayed.

Configure service metrics

Once you add a service, Dynatrace starts automatically collecting a suite of metrics for this particular service. These are recommended metrics. Apart from the recommended metrics, most services have the possibility of enabling optional metrics. You can remove or edit any of the existing metrics or any of their dimensions, where there are multiple dimensions available. Metrics consisting of only one dimension can't be edited. They can only be removed or added.

Service-wide metrics are metrics for the whole service across all regions. Typically, these metrics include dimensions containing Region in their name. If selected, these metrics are displayed on a separate chart when viewing your AWS deployment in Dynatrace. Keep in mind that available dimensions differ among services.

To change a metric's statistics, you have to recreate that metric by choosing different statistics. You can choose among the following statistics: Sum, Minimum, Maximum, Average, and Sample count. The Average + Minimum + Maximum statistics enable you to collect all three statistics as one metric instead of one statistic for three metrics separately. This can reduce your expenses for retrieving metrics from your AWS deployment.

To be able to save a newly added metric, you need to select at least one statistic and one dimension.

Note: Once AWS supporting services are configured, you might have to wait 15-20 minutes before the metric values are displayed.

View service metrics

Once you add the service to monitoring, you can view the service metrics in your Dynatrace environment either on your dashboard page or on the custom device overview page.

Available metrics

Amazon SageMaker Batch Transform Jobs

Name Description Unit Statistics Dimensions Recommended
CPUUtilization The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️
MemoryUtilization The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%. Percent Average Region, Host ✔️
GPUMemoryUtilization The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️
GPUUtilization The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️

Amazon SageMaker Processing Jobs, Amazon SageMaker Training Jobs

Name Description Unit Statistics Dimensions Recommended
CPUUtilization The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️
DiskUtilization The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs. Percent Average EndpointName, VariantName ✔️
GPUMemoryUtilization The percentage of GPU memory used by the containers on an instance. The value can range between 0% and 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUMemoryUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️
GPUUtilization The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. Percent Average Region, Host ✔️
MemoryUtilization The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%. Percent Average Region, Host ✔️

Amazon SageMaker Endpoint Instances

Name Description Unit Statistics Dimensions Recommended
CPUUtilization The percentage of CPU units that are used by the containers on an instance. The value can range between 0% and 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to `400%'. Percent Average EndpointName, VariantName ✔️
DiskUtilization The percentage of disk space used by the containers on an instance uses. This value can range between 0% and 100%. This metric is not supported for batch transform jobs. Percent Average EndpointName, VariantName ✔️
GPUMemoryUtilization The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. Percent Average EndpointName, VariantName ✔️
GPUUtilization The percentage of GPU units that are used by the containers on an instance. The value can range between 0% and 100%and is multiplied by the number of GPUs. For example, if there are four GPUs, GPUUtilization can range from 0% to `400%'. Percent Average EndpointName, VariantName ✔️
LoadedModelCount The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. None Average EndpointName, VariantName
LoadedModelCount None Sum EndpointName, VariantName
MemoryUtilization The percentage of memory that is used by the containers on an instance. This value can range between 0% and 100%. Percent Average EndpointName, VariantName ✔️

Amazon SageMaker Endpoints

Name Description Unit Statistics Dimensions Recommended
Invocation4XXErrors The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. None Average EndpointName, VariantName
Invocation4XXErrors None Sum EndpointName, VariantName
Invocation5XXErrors The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. None Average EndpointName, VariantName
Invocation5XXErrors None Sum EndpointName, VariantName ✔️
Invocations The number of InvokeEndpoint requests sent to a model endpoint None Sum EndpointName, VariantName ✔️
Invocations None Count EndpointName, VariantName
InvocationsPerInstance The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant. 1/numberOfInstances is sent as the value on each request, where numberOfInstances is the number of active instances for the ProductionVariant behind the endpoint at the time of the request. None Sum EndpointName, VariantName
ModelCacheHit The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded None Sum EndpointName, VariantName
ModelCacheHit None Average EndpointName, VariantName
ModelCacheHit None Count EndpointName, VariantName
ModelLatency The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. Microseconds Multi EndpointName, VariantName ✔️
ModelLatency Microseconds Sum EndpointName, VariantName
ModelLatency Microseconds Count EndpointName, VariantName
ModelLoadingTime The interval of time that it took to load the model through the container's LoadModel API call. Microseconds Multi EndpointName, VariantName
ModelLoadingTime Microseconds Sum EndpointName, VariantName
ModelLoadingTime Microseconds Count EndpointName, VariantName
ModelLoadingWaitTime The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference Microseconds Multi EndpointName, VariantName
ModelLoadingWaitTime Microseconds Sum EndpointName, VariantName
ModelLoadingWaitTime Microseconds Count EndpointName, VariantName
ModelDownloadingTime The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3) Microseconds Multi EndpointName, VariantName
ModelDownloadingTime Microseconds Sum EndpointName, VariantName
ModelDownloadingTime Microseconds Count EndpointName, VariantName
ModelUnloadingTime The interval of time that it took to unload the model through the container's UnloadModel API call Microseconds Multi EndpointName, VariantName
ModelUnloadingTime Microseconds Sum EndpointName, VariantName
ModelUnloadingTime Microseconds Count EndpointName, VariantName
OverheadLatency The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency. Microseconds Multi EndpointName, VariantName ✔️
OverheadLatency Microseconds Sum EndpointName, VariantName
OverheadLatency Microseconds Count EndpointName, VariantName

Amazon SageMaker Ground Truth

Name Description Dimensions Statistics Unit Recommended
ActiveWorkers The number of workers on a private work team performing a labeling job Region, LabelingJobName Maximum None
DatasetObjectsAutoAnnotated The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled. Region, LabelingJobName Maximum None ✔️
DatasetObjectsHumanAnnotated The number of dataset objects annotated by a human in a labeling job Region, LabelingJobName Maximum None ✔️
DatasetObjectsLabelingFailed The number of dataset objects that failed labeling in a labeling job Region, LabelingJobName Maximum None ✔️
JobsFailed The number of labeling jobs that failed Region Count None
JobsFailed Region Sum None ✔️
JobsStopped The number of labeling jobs that were stopped Region Count None
JobsStopped Region Sum None
JobsSucceeded The number of labeling jobs that succeeded Region Count None
JobsSucceeded Region Sum None ✔️
TasksSubmitted The number of tasks submitted/completed by a private work team Region, LabelingJobName Maximum None
TimeSpent Time spent on a task completed by a private work team Region, LabelingJobName Maximum Seconds
TotalDatasetObjectsLabeled The number of dataset objects labeled successfully in a labeling job Region, LabelingJobName Maximum None ✔️