Header background

OpenStack monitoring with Dynatrace is now GA

We’re happy to announce the General Availability (GA) of OpenStack monitoring with Dynatrace, bringing our long Early Access Program (EAP) (which began in February 2017) and analysis of customer requirements to a close. The Dynatrace OpenStack monitoring solution is GA as of the Dynatrace version 1.162 release and the OneAgent version 1.161 release.

This blog post shows you how to get the most value out of Dynatrace monitoring when you use the OpenStack cloud to provision infrastructure components.

Enable full-stack monitoring of OpenStack environments easily

The structure of this article follows the discovery path, from application performance and availability monitoring, through the monitoring of underlying services, all the way to the supporting infrastructure and its management. In true Dynatrace fashion, we’ve placed our bets on ease of deployment and automation of discovery. Your exploration and level of awareness of your OpenStack-managed environment is dependent only on your decision as to how deep you want to drill down, and in particular, your decision about placement of OneAgents within your monitored environment.

Let’s consider the journey through and exploration of different levels of OpenStack awareness in these simple steps, each provides additional insight into OpenStack infrastructure:

Step 1: OneAgent on VMs managed by OpenStack—awareness of the OpenStack as an orchestration layer

Full-stack monitoring of applications with Dynatrace is only possible if you deploy full-stack OneAgents on important hosts in your environment, specifically those that host your applications’ services and resources.

When OneAgent is deployed on virtual machines operated by OpenStack, you can take advantage of the powerful Dynatrace APM value proposition: zero-configuration detection of applications, services, problems, and root cause analysis. Over and above that, we identify OpenStack as acting as an cloud technology and provide information about OpenStack’s compute node.

OpenStack identification as an cloud technology with compute node information

Smartscape analysis shows you how your VMs interact with each other and gives you an understanding of the vertical dependencies between your application components—virtual machines, processes, and services.

Smartscape analysis of OpenStack VMs

Step 2: OneAgent on OpenStack compute nodes—awareness of services and resource utilization of OpenStack services on VMs

If needed, OneAgents can also be deployed on OpenStack compute nodes. In such cases, we recommend that OneAgents be configured for cloud infrastructure-only monitoring mode. This is dictated by the fact that there are typically no injectable technologies to monitor, and it helps reduce the cost of host units consumed by OneAgents.

When deployed on compute nodes, OneAgents provide valuable insight into the existence and resource allocation of VMs managed by OpenStack, as well as their availability, responsiveness, associated worker processes, I/O operations, and more.

Additionally, all OpenStack services running on the compute node are properly discovered and measured for availability and resource consumption.

OpenStack compute node process list OpenStack compute node processes OpenStack monitoring

OpenStack compute node-managed VM metrics

Step 3: OneAgent on OpenStack controller nodes—awareness of services and their resource utilization for important OpenStack services

When OneAgents are deployed on OpenStack controller nodes, it’s possible to detect and monitor the remaining OpenStack services—those that are not typically found on compute nodes but are important elements of OpenStack.

Dynatrace provides out-of-the-box alerting on resource allocation and availability for these processes.

OpenStack controller node processes OpenStack monitoring

Step 4: Deep insight into OpenStack via plugins

Under the hood of OpenStack, there are several popular technologies that we can also monitor with Dynatrace OneAgents through the use of their respective plugins. These technologies include RabbitMQ, MySQL, HAproxy, and MemCached. The plugins require additional configuration (namely, access to these services’ APIs), but in return, provide technology-specific measurements.

OpenStack supporting technology monitoring: RabbitMQ

To illustrate the challenges involved in monitoring the technologies that support OpenStack, here’s a problem we ran into within our own OpenStack environment. The RabbitMQ process in the example below was launched using the default file descriptor limit of 1024. Once this limit was exceeded, RabbitMQ stopped accepting new connections. This resulted in a Connectivity problem.

RabbitMQ connectivity problem

We wouldn’t have known about this problem if it weren’t for the RabbitMQ-specific measurements that Dynatrace provides. All details are included in the same view, so there is no need to use multiple tools to get the complete picture.

RabbitMQ connectivity problem details

Step 5: Log Analytics

Dynatrace comes with a powerful Log Analytics module that can be applied to monitor OpenStack services. When configured, it picks up symptoms of problems specific to OpenStack and takes them into account while performing the root-cause analysis of the solution.

In the example below the Log viewer has uncovered numerous warnings in the keystone.log file indicating that the authentication process has been failing.

Log viewer for Keystone log file on controller node

In this particular case, the root cause of these problems was related to memory saturation on the controller node. As illustrated below, the memory was indeed exhausted: it had reached almost 100% saturation.

Note further down in the Processes section that all OpenStack services running on the controller are listed. You can click any of these individual processes to analyze their connections and understand their relationship to other processes.

Memory saturation problem on controller node

The Log Analytics module is fully configurable. Below are a dozen example configurations that can be easily changed and adapted to your local OpenStack environment. They were tested to work with older versions of OpenStack, so some updates might be required for more recent releases.

For Glance service (log path /var/log/glance/glance-api.log):

  1. Glance registry can’t connect to SQL database because connection pool is empty
    search pattern: ERROR AND "OperationalError:" AND "pymysql.err.OperationalError" AND "Too many connections"
    threshold: 0.0
  2. Glance registry can’t retrieve list of images
    search pattern: ERROR AND "glance.registry.api.v1.images" AND "Unable to get images"
    threshold: 0.0
  3. Glance API returned an error while using Glance registry
    search pattern: ERROR AND "glance.common.wsgi ServerError" AND "The request returned 500 Internal Server Error"
    threshold: 0.0
  4. Glance API authorization issue: Unable to validate token
    search pattern: CRITICAL AND "Unable to validate token"
    threshold: 0.0
  5. Glance API authorization-configuration issue
    search pattern: DiscoveryFailure AND "Could not determine a suitable URL for the plugin"
    threshold: 0.0
  6. Glance API can’t connect to SQL database
    search pattern: ERROR AND DBConnectionError
    threshold: 0.0

For Neutron service:

  1. Neutron agent can’t connect to SQL database
    search pattern: ERROR AND "neutron.agent.dhcp.agent" AND "DBConnectionError" AND "Can't connect to MySQL"
    threshold: 0.0
    log paths: /var/log/neutron/dhcp-agent.log
  2. Neutron can’t connect to SQL server
    search pattern: ERROR AND "OperationalError:" AND "Too many connections"
    threshold: 0.0
    log paths: /var/log/neutron/metadata-agent.log, /var/log/neutron/neutron-server.log, /var/log/neutron/openvswitch-agent.log, /var/log/neutron/neutron-ns-metadata-proxy-#.log, /var/log/neutron/l3-agent.log, /var/log/neutron/dhcp-agent.log, FIXED
  3. Neutron: l3 agent configuration issue
    search pattern: ERROR AND "neutron.agent.l3.agent" AND "An interface driver must be specified"
    threshold: 0.0
    log paths: /var/log/neutron/l3-agent.log
  4. Neutron server is overloaded and unable to respond quickly: Timeout in RPC method get_service_plugin_list
    search pattern: ERROR AND "neutron.common.rpc" AND "Timeout in RPC method get_service_plugin_list"
    threshold: 0.0
    log paths: /var/log/neutron/l3-agent.log

For Keystone service:

  1. Keystone can’t connect to SQL database.
    search pattern: DBConnectionError
    threshold: 0.0
    log paths: /var/log/keystone/keystone-wsgi-admin.log, /var/log/keystone/keystone-manage.log, /var/log/keystone/keystone.log, /var/log/keystone/keystone-wsgi-public.log, /var/log/apache2/error.log, /var/log/apache2/keystone_access.log, /var/log/apache2/keystone.log
  2. Keystone: Apache WSGI configuration is broken
    search pattern: "Target WSGI script not found" AND keystone-wsgi
    threshold: 0.0
    log paths: /var/log/apache2/keystone.log

Potential improvements and further steps

When we defined the original scope of the OpenStack monitoring EAP, we developed a number of specific plugins for OpenStack services. The goal of these plugins was to provide additional insight into specific metrics for Keystone, Horizon, and Glance. They are currently not part of the out-of-the-box solution, but can be retrofitted and included in OneAgents with some effort related to the exposure of their respective configurations.

Glance plugin metrics Keystone plugin metrics

The data provided by these plugins can be also analyzed by Dynatrace AI and taken into account during root-cause analysis. It can also be subject to alerting and integrations with external services.

We want to hear from you

We’re always happy to receive your feedback and ideas. Reach out to us via Dynatrace CommunityDynatrace Support, or your Dynatrace representative to share your thoughts with us. Please let us know how you are using OpenStack infrastructure monitoring by Dynatrace in your environment and how we can improve it to make it even better..