Goal oriented Auto Scaling in the Cloud

The ability to scale your environment on demand is one of the key advantages of a public cloud like Amazon EC2. Amazon provides a lot of functionality like the AutoScaling groups to make this easy. The one downside in my mind is that basing auto scaling on system metrics is a little naive and from my experience only works well in a limited number of scenarios. I wouldn’t want to scale indefinitely either, so I need to choose an arbitrary upper boundary to my auto scaling group. Both, the upper boundary and the system metric, are unrelated to my actual goal, which is always application related; e.g. throughput or response time.

Some time back Amazon added the ability to add custom metrics to Cloud Watch. This opens up interesting possibilities. One of them is to do goal oriented auto scaling.

Scale for desired throughput

A key use case that I see for a public cloud is batch processing. This is throughput and not response time oriented. I can easily upload the measured throughput to CloudWatch and trigger autoscaling events on lower and upper boundaries. But of course, I don’t want to base cloud scaling events on throughput alone, if my application isn’t doing anything I wouldn’t want to add instances. On the other hand defining the desired throughput statically might not make sense either as it depends on the current job. My actual goal is to finish the batch in a specific timeframe. So let’s size our EC2 environment based on that!

I wrote a simple java program that takes the current throughput, remaining time plus remaining number of transactions and calculates the throughput needed to finish in time. It then calculates the difference between actual and needed throughput as a percentage and pushes this out to cloud watch.

public void setCurrentSpeed(double transactionsPerMinute, long remainingTransactions,
                                              long remainingTimeInMinutes, String JobName)
  double targetTPM;
  double currentSpeed;
  if (remainingTimeInMinutes > 0 && remainingTransactions > 0)
  {// time left and something to be done
    targetTPM = remainingTransactions / remainingTimeInMinutes;
    currentSpeed = transactionsPerMinute / targetTPM;
  else if (remainingTransactions > 0) // no time left but transactions left?
    throw new SLAViolation(remainingTransactions);
  else // all done
    currentSpeed = 2; // tell our algorithm that we are too fast,
                      //if we don't have anything left to do

  PutMetricDataRequest putMetricDataRequest = new PutMetricDataRequest();
  MetricDatum o = new MetricDatum();
  o.setDimensions(Collections.singleton(new Dimension().withName("JobName").

After that I started my batch job with a single instance and started measuring the throughput. When putting the “CurrentSpeed” into a chart it looked something like this:

The speed would start at 200% and go down according to the target time after the start
The speed would start at 200% and go down according to the target time after the start

It started at 200%, which my java code reports if the remaining transactions are zero.  Once I start the load the calculated speed goes down to indicate the real relative speed. It quickly dropped below 100%, indicating that it was not fast enough to meet the time window. The longer the run took, the less time it had to finish. This would mean that the required throughput to be done in time would grow. In other words, the relative speed was decreasing. So I went ahead and produced three AutoScaling actions and the respective alarms.

The first doubled the number of instances if current speed was below 50%. The second added 10% more instances as long the current speed was below 105% (a little safety margin). Both actions had a proper threshold and cool down periods attached to prevent an unlimited sprawl. The result was that the number of instances grew quickly until the throughput was a little more than required. I then added a third policy. This one would remove one instance as long as the relative current speed was above 120%.

The Adjustments result in higher throughput which adjust the relative speed
The Adjustments result in higher throughput which adjust the relative speed

As the number of instances increased so did my applications throughput until it achieved the required speed. As it was faster than required, the batch would eventually be done ahead of time. That means that every minute that it kept being faster than needed, the required throughput kept shrinking. Which is why you see the relative speed increasing in the chart although no more instances were added.

Remaining Time in Minutes Remaining Transactions Required Throughput per Minute Actual Throughput Relative Speed
60 600 10 12 120%
59 588 9.97 12 120.4%
58 576 9.93 12 120.8%
57 564 9.89 12 121.2%

Upon breaching the 120% barrier the last auto scaling policy removed an instance and the relative speed dropped. This lead to a more optimal number of instances required to finish the job.


Elastic Scaling is very powerful and especially useful if we couple it with goal oriented policies.The provided example does of course need some fine tuning, but it shows why it makes  sense to use application specific metrics instead of indirectly related system metrics  to meet an SLA target.

Those who know me know that I'm passionate about 3 things: rock climbing, physics, and performance. I've worked in performance monitoring and optimizations in enterprise environments for the better part of the last 10 years. Now as a Product Manager I am doing my best to build those experiences into Dynatrace.