A look behind the scenes of AWS Lambda and our new Lambda monitoring extension

In this practitioner’s blog post, the first in a series on AWS Lambda, we cover how Lambda works under the hood and how it's different from traditional host- or container-based systems. We also explore the considerations we made at Dynatrace in building the extension, along with examples of how we tackled the unique requirements of monitoring Lambda functions.

Since its introduction by AWS in 2014, AWS Lambda has revolutionized the compute space and boosted the entire serverless movement. Gartner predicts that by 2025, 50% of all global enterprises will have deployed serverless function platforms as a service (fPaaS), up from only 20% today.

Dynatrace has offered a Lambda code module for Node.js since 2017, and many customers have used it with great success while we collected requirements for the next iteration of our Lambda extension.

This has led to the recent release of our new Lambda monitoring extension supporting Node.js, Java, and Python. This extension was built from scratch to take into account all we’ve learned and the special requirements for monitoring ephemeral, auto-scaling, micro VMs like AWS Lambda.

A look under the hood of AWS Lambda

When using a platform, it always helps to have a rough idea of its inner workings. At AWS re:Invent in 2018, the Lambda team presented an excellent talk. Much of what’s covered in this blog post is taken from that talk.

What the Lambda team introduced in 2018, for example, the new Firecracker VM, has since been fully rolled out. So while some details might have changed, the following overview is still accurate and relevant.

Lambda worker nodes

Lambda functions run on EC2 instances with Amazon Linux as the host OS and a hypervisor. These EC2 instances, the so-called “Lambda worker nodes,” host many Lambda functions from different customer accounts.

Distributing accounts across the infrastructure is an architectural decision, as a given account often has similar usage patterns, languages, and sizes for their Lambda functions. This can lead to a concentration of functions of the same language, size, and workload. Distributing functions regardless of the underlying account mitigates this. Of course, this requires a VM that provides rock-solid isolation, and in AWS Lambda, this is the Firecracker microVM. One VM executes a single Lambda function, as shown in the image below.

Figure 1: Conceptional overview of Lambda functions

The guest OS is in a shielded sandbox from where the language-specific Lambda runtime is executed. Finally, this runtime executes the user’s function and does not allow concurrency. This means that at any given time, a Lambda function executes exactly one request. To handle N parallel requests, N Lambda instances need to be available, and AWS will spin up up to 1000 such instances automatically to handle 1000 parallel requests.

The provided function is usually a module or class with a specific interface known by the runtime. There is a way to specify where to find the specific file that serves as the entry point.

This is a Lambda function for Node.js, and it would likely be in an index.js file uploaded to AWS Lambda.

exports.handler = async function(event, context) {
console.log("EVENT: \n" + JSON.stringify(event, null, 2));
return context.logStreamName;
}

The Lambda execution life cycle

AWS Lambda has a lot of moving parts that take care of executing your functions. When a new request comes in, the AWS Lambda infrastructure looks for an idle instance of the requested Lambda function or provisions a new one on a worker node with capacity and then executes the request.

The diagram below shows the complete provisioning life cycle of a Lambda invocation. The red boxes in this diagram show operations that only take place on a so-called “cold start.”

Figure 2: Lambda provisioning life cycle

Understanding cold starts and the life cycle of a single function

To understand cold starts, let’s take a closer look at the life cycle of an individual Lambda function instance.

Figure 3: Life cycle of an individual Lambda function

A cold start occurs when there’s no instance of the requested Lambda function available. In this case, as the provisioning life cycle diagram shows, the worker node has to be provisioned with the given Lambda function, which does, of course, take some time. We’re talking about a few hundred milliseconds, but this can have a considerable performance impact when cold starts occur too often. If there’s an idle instance of a Lambda function available, no cold start is needed, and the function can be executed right away.

After each execution, AWS Lambda puts the instance to sleep. In other words, the instance freezes (similar to a laptop in hibernate mode). The virtual CPU is turned off. This frees up resources on the worker node. The overhead from waking up such a function is negligible.

Of course, the Lambda infrastructure also needs to take care of disposing of instances that are frozen for a certain length of time as the sandbox still allocates resources on disk. The time it takes until an unused function is disposed of is solely up to AWS, but a ballpark number is around one hour. Additionally, Lambda functions, regardless of whether they’re constantly utilized or not, will be disposed of after about six hours. This is another measure to evenly redistribute the load within the AWS Lambda infrastructure.

Special challenges when monitoring Lambda functions

In theory, an existing code module or agent can be used to monitor a Lambda function if there’s a way to load it into the running Lambda process. Usually, this can be done by defining a wrapping function that instruments the actual user function and then runs it.

However, as mentioned already, there’s a cold-start phase when the function code has to be loaded into the sandbox, and a function can freeze at any time. This means that the instrumenting code itself, which is the agent contained in the monitoring extension, should be small in size so it doesn’t add too much time to the download and unpacking phase.

Additionally, the reserved memory size of a Lambda function is a cost factor, and a small memory footprint is another requirement. Also, agents optimize communication and, therefore, send monitoring data to back ends in batches. As a Lambda function can freeze anytime without warning, a Lambda code module needs to try to get the data out-of-process as quickly as possible without extending the runtime of the Lambda function too much.

On the other hand, features needed for monitoring large applications, like memory dumps, code-level-visibility, or event loop metrics on Node.js are not relevant for functions that usually run for just a few milliseconds and aren’t complex.

These are requirements and constraints that don’t exist in traditional environments, not even when running microservices in containers. Dynatrace tackled these challenges by writing our Lambda code module from scratch to include the following:

  • A small file size
  • A small memory footprint
  • A fast cold start

Monitoring multiple ephemeral instances of the same function

It would be easy to treat a single instance of a Lambda function like a host or a container and report its data individually, but this would provide limited value in this computing model.

Dynatrace aggregates the data from all instances of a single Lambda function within a single service. The result is overall performance metrics that provide valuable insights into how a function is performing regardless of the instance that handles the request.

AWS Lambda Monitoring isn’t an afterthought—it’s seamlessly integrated into Dynatrace. The screenshot below shows a Lambda function represented as a service, with all key Lambda metrics collected from CloudWatch mapped to the service.

Figure 4: Lambda service page

In a future blog post, we’ll explain how to interpret Lambda metrics. A key benefit is that Lambda function metrics are part of the larger Dynatrace stack, allowing you to see Lambda operations in the context of the complete application.

Why metrics alone aren’t enough

Lambda functions often serve as a gluing tier or a gateway from on-premises systems to AWS services; they’re almost never standalone. Depending on how resilient the calling system is, a failure in a Lambda function can potentially take down an entire website or application.

To understand how your Lambda services are called and which downstream services they depend on, you need distributed tracing. On top of that, distributed tracing also provides exact metrics for response times along with metadata for every single call.

In Dynatrace, distributed tracing isn’t a bolted-on addition to the product, it’s a core feature of the Dynatrace platform. True to our heritage, we made sure that our new Lambda distributed tracing functionality is as feature rich and detailed as you can find in any of our offerings (see the PurePath view of a Lambda function below).

Figure 5: PurePath view of a Lambda function

Dynatrace automatically detects services and treats Lambda functions just like any other service. This means that Lambda functions show up in all the topology-centered views such as Service flow (shown below), Backtrace, or Smartscape, providing valuable, aggregated insights about how different functions are called.

Figure 6: Service flow of a Lambda function

In regards to cold starts and how they affect performance, ideally you want to know which requests were slowed down because of a cold start or, more generally, how cold starts affect the response time.

Our Lambda code module automatically marks requests that contain a cold start and makes them available for filtering and charting to identify situations where cold starts impact performance (see the invocations and response times below for a Lambda function that’s filtered by cold starts). It’s also possible to direct Dynatrace anomaly detection to the number of cold starts and be alerted if an increase in the number of anomalies is detected.

Figure 7: Invocations and response times filtered by Cold Starts

How Dynatrace compares to other solutions

When we set out to create the new Lambda extension, we benchmarked other dedicated Lambda monitoring solutions that were already on the market. They didn’t provide anything beyond Lambda or AWS monitoring, and they set a functionality baseline for us. We’re excited that we’ve delivered a differentiated and integrated approach to Lambda monitoring that’s second to none.

Most full-featured monitoring solutions only give you CloudWatch metrics. Some provide in-process agents, but the way they’re deployed is rather complicated. Also, if a competing agent provides distributed tracing, it feels bolted on, and the data provided isn’t integrated with the rest of the product. Our competitors also don’t offer anomaly detection or AI that combines topology, metrics, and traces to detect the root causes of problems. Dynatrace has been a market leader in this space year after year, and the addition of our Lambda offering extends our capabilities.

Dynatrace also delivers a unique capability by providing end-to-end tracing from real user actions in web applications all the way through to Lambda functions; this is something not seen in any other solution.

Lambda functions are often used as back ends for static, single-page applications, and many customers asked for the same level of insights they get when running Dynatrace against a regular web server. Our solution fulfills this requirement now as well.

Figure 8: End to-end-tracing for XHR calls into Lambda functions
End to-end-tracing for XHR calls to Lambda functions

What’s next

As you can see, a lot of thought and complex considerations went into our new Lambda extension. In describing this extension so far, we’ve only scratched the surface and haven’t covered Lambda deployment or explained the data collected in detail. You’re probably also interested in the most common use cases we see with our customers. We’ll cover these topics and more in follow-up blog posts over the coming weeks.

The Dynatrace AWS Lambda extension, with all its advanced capabilities, is just a first step—we have lots of great functionality in the pipeline including:

  • Ingestion of  AWS Lambda logs
  • Support for additional languages
  • Support for container images deployed on AWS Lambda

Stay tuned for more blog posts and announcements or, if you’re new to Dynatrace, start your free trial today.

Stay updated