Application development and delivery methods have undergone radical changes in recent years to improve scalability and resiliency. Container images are the new build and deployment artifacts that are used to ship and run software. While startups have long been comfortable experimenting with and embracing new technologies, even large enterprises are now re-architecting their software systems so that they can benefit from container-enabled microservices architectures.

With the launch of DC/OS, we see adopting these container strategies being made an order of magnitude easier.

Orchestration layer is the backbone for microservices

When it comes to running and managing microservices environments the real challenge isn’t the container technology. The real challenge is controlling and mastering the new approach to container orchestration. This is where container management tools and frameworks come into play.

Based on observations we’ve made in large real-world environments, Mesos with Marathon (the core rock-solid production packages of DC/OS) is an ideal solution for reliably deploying container-backed microservices in clusters with hundreds of 16+ vCPU nodes. The orchestration layer manages available resources in the cluster and follows a “supply-driven” approach to allocating the resources required to spin up Docker containers with Marathon. This makes the orchestration layer crucial for running clustered environments and underscores the requirement that automated fail-over mechanisms be put in place in case nodes fail.

Enhance visibility into your DC/OS cluster

While DC/OS is an ideal solution for battle-hardened container orchestration, Dynatrace Ruxit enables you to get deep visibility into the applications you deploy on your Mesos clusters. If you’re already using DC/OS, simply issue dcos package install --options=ruxitoptions.json ruxit in your terminal to have DC/OS deploy Dynatrace Ruxit on your nodes. You’ll need to provide your Dynatrace Ruxit credentials in ruxitoptions.json.

  "ruxit": {
  "environment": "your-environment-id",
  "token": "your-token",
  "instances" : NUMBER_OF_NODES

Dynatrace Ruxit on DCOS Marathon tasks

Resource optimization based on actual consumption

The key metric for scheduling containers (beyond constraints, resource pools, etc.) is the amount of memory to be allocated by the target node. For instance, if you run a Java service with an embedded Tomcat you may want to allocate 1.5 GB of memory for each container to avoid under-allocation. Running ten containers of the same service results in 15 GB memory allocation on your cluster. However, the actual required memory for a single container might not exceed 700 MB per container. This results in a cluster-wide over-allocation of 8 GB.

Docker containers by image name

To find the ideal amount of allocated memory for the Marathon task you need to work with actual monitoring data from the respective container or process. Working from real data enables you to optimize container deployments and get more out of your cluster nodes by safely slimming down memory allocations for containers and Marathon tasks.

Optimize capacity management with real usage data

Cluster nodes require adequate CPU and memory for optimum performance. CPU and memory are apparently limited resources on cluster nodes however. Even if you scale nodes vertically and run huge instances you need to manage container memory allocations carefully to ensure the optimal number and type of containers. However, operating clusters with huge instances can cause new issues with disk latency and network traffic, which can lead cluster bottlenecks.

Mesos nodes with slow disk

The more tasks and containers you run on a single node the more likely the nodes will fight for limited resources and impede each other performance-wise. This is particularly true for microservices environments which make heavy use of shared network links that have limited bandwidth.

Co-locate containers based on communication behavior

To reduce network lag and prevent poor communication between chatty containers you should co-locate related containers on the same node by configuring Marathon application groups.

Request flow through services

In the example above the two highlighted services should be co-located in an application group to prevent the network from being swamped with ~2M service calls in 2 hours, which equates to 16.6k/min. As you can see, real usage data from Ruxit service flow helps in optimizing container and service deployments with Marathon.

Make sure your applications are healthy

Maintaining cluster health with Marathon health checks is only one part of the story. End users are focused on application performance while DevOps are focused on finding performance hotspots and failing service issues to resolve. In a highly distributed system with an ephemeral infrastructure this means you need to track all dependencies between services, service calls, databases, containers, and processes so that DevOps gets actionable information related to components that need their attention.
In a microservices environment that includes dependent microservices, versions, APIs, load balancers, and caches you need to be able to spot issues quickly as they often impact functionality and slow services used by other teams.

Service dependencies and problem resolution

Tracking slowdowns and outages down to the code-level of service methods and database queries helps your team resolve bugs quickly and deploy new versions of affected services and containers. This includes investigating failures in service calls where you need to identify the root causes of HTTP 4xx and 5xx response codes.
Dynatrace Ruxit is the ideal solution for monitoring application and cluster health in highly-dynamic container environments. It’s the quickest way to get the actionable information your team needs to fix problems quickly.