Why I'm spicy about Lambda coldstarts

Spicy Coldstart

Big coldstart will try anything. It’s bikeshedding. Git gud. Negligible at scale. Use the AWS SDK v3. Read this blog. Just use provisioned concurrency.

But the truth is, we aren’t holding it wrong and coldstarts aren’t all your fault. In this post, I fail to stay frosty. In fact I get spicy, and make the case that we shouldn’t lay off the pepper until we stopped getting blamed for AWS’s coldstart problem.

Warning

The opinions in the links above are all valid. I encourage you to read them and come to your own conclusions. Spiciness is a personal choice. If you've established a TP99 budget and followed the 5 stages of Lambda coldstarts:

Ignorance 🫠
There I fixed it 🔨
Oh fresh hell, there’s more 🕷️
Blogging to dull the pain 🍸
Spicyness 🌶️

and you've actually tried to solve your coldstart problem with the AWS JavaScript SDK v3, only to find AWS is doing things that make it hard, then this post is for you.

What is a cold start?

It’s the occasional 125 - 1500 ms before Lambda starts doing what you want it to. It's when Lambda needs to create a container, load your code, and initialize it to serve the request. Most people don’t notice, but those who pay attention do.

we upgraded from aws sdk v2 to v3 and our cold starts went from p90 of 640ms to ~1900ms 🤯

I added a trace picked at random for illustration pic.twitter.com/hR0dY9LCGc
— boris tane (@boristane) December 20, 2023

According to the shared responsibility model. The Lambda team is responsible for optimizing the runtime and we are responsible for optimizing our own code. Runtime-wise, the Lambda team has worked hard to innovate and reduce the impact of coldstarts. There's Firecracker and how they've built a distributed cache to speed up container loads. Good stuff. Remember, part of the power of Lambda is that it's a compression algorithm for experience. Years of best practices are baked into it.

Great. But when you start to optimize your code, eventually you'll realize that those best practices don't extend to everything built at AWS. When that happens, I think of this Office Space quote: No way. Why should I change? He's the one that sucks.

When does a coldstart become a problem?

Low throughput APIs. Certain APIs are low throughput and this is where coldstarts start to sting. Think user signup, password reset and in some cases user signin if the token lasts all day. Thankfully your end user absorbing a second or two in these flows doesn't usually matter. Where it does start to matter is when there is a hard timeout. API Gateway Lambda Authorizers have a 10 second timeout which isn't so bad, but Amazon Cognito triggers have a 5 second timeout which is more painful. This can lead to retries, general weirdness and novel workarounds.
Bursty APIs with highly cacheable content. The longer the cold start takes, the more instances need to be spun up because every request thinks it needs to warm the cache until the first one finishes. Once the cache is warm, the latency drops back down and the number of instances required to handle the burst also goes down. This is a good use case for provisioned concurrency, if you can't get the cold start down to an acceptable level.
At scale. This one may not be obvious, but the repeated execution of non-optimized code adds up. It requires more instances to handle the same number of requests. Lambda foots the bill on this, but the latency is passed on to us. Hat tip to Filep Pyrek on this one.

What are some of the ways AWS slows down our coldstarts?

aws-cdk/25492 The default bundling with the CDK loads the aws-sdk from disk. Why would I turn on bundling if I didn't want bundling?
aws-js-sdk-v3/5516 The SSO tax. Even though SSO isn't used in Lambda, it is still loaded.
aws-xray-sdk-node/482 The x-ray sdk doesn't support ESM bundling increasing your bundle size.

What about provisioned concurrency?

Don't be fooled, provisioned concurrency is protection money. There are cases that it makes sense, but remember you pay $5.50 per provisioned concurrency/month. For the same price you can run an EC2 T3 nano or make 3.5 million more lambda requests. If you optimize your coldstarts, or the above GitHub issues are fixed, you probably don't even need it. Personally, I'd rather spend the protection money on the $2.99 to remove ads from Prime Video.

There are a couple other gotchas as well. It will add 2-3 minutes to each deployment and you may not be able to immediately turn it on. If you use AWS Organizations like I do, you may need to request a limit increase first. When I tried to enable it, I found the default Lambda concurrency limit for org accounts is 10. You need a concurrency limit of at least 101 to turn it on.

What can AWS do about it?

The Lambda team should meet with the other teams and hash a few of those Github issues out. The impact is measurable, they are solvable and would make a difference for all of us. The best part is we wouldn't need to change anything, just upgrade the SDK/CDK. Also, while I'm spicy, I'll make one more request. Make it easier to test a coldstart. It's too much effort to trigger one today (wait, update an environment variable or push code). If there was a toggle or environment variable we could set to make every request a coldstart, I could test multiple coldstarts and tune my code faster.

Conclusion

Spend a little time optimizing your coldstarts, but after that stay spicy friends. AWS needs to do their part too.