Deployment challenges with large enterprise systems

Dynatrace provides automated end to end monitoring of applications under a single platform. Even if this is true, in practise there is only so much the one Agent can Autodiscover and fine tuning might be required.

For instance, not all deployments follow best practise. I have worked with many customers and I found that in most cases, 90% of what was mapped automatically was fine however the remaining 10% needed fine tuning due to bad deployment practises.

The most common example is when the IIS default application pool is used for deploying a web site instead of using a dedicated pool.

The same applies to other application servers such as Tomcat.

For small deployments, it isn’t a problem however when scaling up to hundred or even thousands of systems things can become complicated.

Even when all the systems are mapped correctly by Dynatrace, identifying these systems is a real challenge. Which team should be getting the alerts for a given component for instance? Where is my application amongst thousands of hosts?

Why Process groups are not perfect

Dynatrace detects automatically identical processes and puts them under the same process group. It is very similar to a cluster with nodes underneath.

Let’s consider a backend API service deployed on an IIS application pool called backend-API.

When deploying on multiple machines, the one agent will group all the instances of the same system together.

Host-A:

application pool -> backend-API

Host-B:

application pool -> backend-API

Dynatrace will create a single process group called backend-API with 2 process instances underneath (host-A and host-B).

This sounds fine however things can go wrong in other cases.

Let’s extend our example.

As the Dynatrace deployment is very successful, a third host is added with the same backend API but this time, the host is no longer for production but for pre-production.

Because the deployment is identical, Dynatrace will add a third instance to the existing progress group called backend-API. The agent has no way of knowing that host-C is not related to host-A neither host-B because it is a different environment.

Even if it doesn’t look bad at first sight, this could lead to confusions and create false alarms.

For instance, if a new version of the code is deployed on pre-production only and the code breaks the functionality, Dynatrace will generate an alert against the underlying service deployed on the process group. As Dynatrace relies on dynamic baselines, it will see an increase in failure rate and incorrectly report it for the whole service group (production and pre-production).

If host-C had been physically separated from the other 2, the alert should only be reported for pre-production. Typically, non-production alerts could be suppressed to avoid unnecessary 24/7 call outs.

You may think that having 2 tenants (production and non-production) would avoid this type of problem but you would be wrong. The above example will no longer be an issue with separate tenants, but other problems could arise.

Let’s look at another example.

We now have separate 2 applications deployed on 2 hosts, but the application pool name hasn’t been updated to reflect the underlying functionality. To make matters worse, the default name has been kept so our application is called default application!

Dynatrace will automatically group both systems. Each service will be detected properly but when using the Smartscape, things will look confusing to the users. It will show a single process group called default application with 2 services. I have seen so many customers complain about this because they know that the systems are not related and “Dynatrace shows it incorrectly”!

This situation can easily be avoided by using host groups. /support/help/how-to-use-dynatrace/hosts/configuration/organize-your-environment-using-host-groups/

Separating systems correctly is a great step forward but it could be better.

When a process is automatically detected, Dynatrace will check if there is a host group associated with the process. If it sees two identical processes but with different host groups, it will create two separate process groups. This is why customers typically set the host group to the environment name which ensures that all environments are split.

I recently worked with a large customer who was using many “default application pool” on hundreds of unrelated systems.

The host group help split them out however, it is very hard to identify which system they belong to as they were all called the same, “default application” and renaming the pool at the source wasn’t an option!

This is when tagging comes to the rescue!

Automated Tagging

Tagging is a very powerful way to identify components. Sadly, many companies want to use tags but don’t really have a companywide naming standard. Most rely on manual tags which are prone to errors and totally fail in highly dynamic cloud environments when new hosts appear and disappear throughout the day.

With manual tags, there is no validation. Anyone can really add tags without considering the big picture.

I recently saw an example where the customer needed to identify their Tibco estate and I found 3 Tibco tags trying to achieve the same thing.

Tibo.ems, tibco, Tibco.

When considering tagging, all entities must be tagged which means not just the hosts! This must apply to processes, services and applications.

If you start using management zones based on tags, when the tag is missing, the entity is simply not visible, so it can be a problem!

Therefore, tagging must be automated.

Before we start, we need to ask ourselves, what are we trying to tag?

Another important thing to note is that tags can be key value pairs, therefore instead of having a tag for Tomcat, WebSphere, WebLogic, you could have a single tag called appServer with multiple values (Tomcat, WebSphere, WebLogic).

Everybody has different requirements but here are a few ideas:

  • Environment (production, pre-production, UAT)
  • Application (e-commerce, CMS, batch)
  • System (Tibco, API-gateway, Weblogic, shared-middleTier)
  • Team (team responsible for this system)
  • CMDB-ID (unique identifier to cross reference with the internal CMDB)

Once the tags have been agreed, they need to be populated automatically or at least be added to an excel spreadsheet for instance indexed by host name if agents are to be deployed manually.

My approach works if a single machine only hosts a single application as the tagging is done at the host level. If it is not the case, an alternative solution is required via environment variables for instance.

Dynatrace supports tagging files such as hostautotag.conf (/support/help/how-to-use-dynatrace/hosts/configuration/define-tags-and-metadata-for-hosts ) so the tags could be stored in the file however, this wouldn’t address the grouping of incorrect process groups I described earlier. Moreover, it can complicate the deployment of the one agent. The file needs to be created and copied in the right place. If the deployment is fully automated it isn’t an issue but if it is manual, it requires more steps.

Why not use host groups as tag?

The host group parameter can be used to store the tags. This means that when deploying a new agent, all the tags can be passed in at installation time.

The only complication is that the host group value is a single string and spaces are not allowed. Luckily Dynatrace can use regular expressions to separate out the tags from within the string.

Here is a format I have used in the past which passes 3 tags:

HOST-GROUP=Production_e-commerce_backend

The order is very important. The first field is the environment name, the second the application name and the last is the system ID.

The separator is “_”.

You can use a similar pattern such as:

E_Production_A_e-commerce_S_backend

E_ is the delimiter for the environment

_A_ for the application

_S_ for the system

If the string is populated correctly, we now need a tagging rule to dynamically retrieve the values from the host group.

Tagging rules are extremely powerful and can use many predefined placeholders such as the host group to populate the value dynamically.

An example from the field

Let’s create an environment tag with an optional value (the environment name).

As the tag is located inside the host group, we must use a regular expression to extract the first parameter.

The dynamic parameter is {Hostgroup:name} and the regular expression must be added inside the parameter after a “/”.

The regular expression simply starts from the beginning of the string and extract any alphanumerical character including a “-“ until the first “_”.

Just to avoid issues, we add a condition to ensure that the host group exists.

The last trick I’m using relies on the Smartscape. Because Dynatrace knows the dependencies between hosts, processes and services it can propagate the tag to the processes and services running on the machine.

It is important to set the scope to process and not host neither service.

There should always be a process on a machine but there might not always be a service so if the scope was set to service, the tag might not appear on the host.

I didn’t select host because the inheritance is only propagated to the processes and not the services.

See full example below.

Here is another example for the application tag:

This time, we extract the string between the underscores.

For the system ID, we extract from the end of the string ([A-Za-z0-9-]*+)$

Once all the rules are in place, the tags will be automatically populated. Because we used the inheritance, all processes and services running on this host will be tagged too!

See example below for a host but the same tags will be propagates to the process groups and services running on this host.

Now that all the tags are in place, we can use management zones and alerting profiles!

Problem solved!

With 3 tagging rules, we automatically tagged all our entities. If the host group string is correctly populated ideally via automation, everything is taken care of for once and for all.

Moreover, because we have used host groups, all our processes are mapped correctly.

Your data is now very structured and easy to access.

Stay updated