Logging for scale
How to log Hello World
is taught, but how to log something useful isn't. Too often, developers add logging until their code works, then they forget about it. In this post I'll teach you how to instrument your code with request logs so when you have millions of users and production is going sideways, you know exactly what is going on.
Note
Although the concepts in this post apply to any platform, I will focus on using AWS Lambda, the Embedded Metrics Format, Lambda PowerTools and AWS CloudWatch. These technologies pair extremely well together and make logging for scale easy.
Why log?
When something goes sideways, you need to answer questions like:
- What is going on? Is it me or a dependency throwing exceptions?
- Who is hammering my api?
- Which of my dependencies is slow right now?
If you've instrumented logging correctly, these answers come to you.
The two types of logs
There are two types of logs; application logs and request logs.
Application logs contain a timestamped story of a request as it's processed. They are emitted multiple times per request. Request logs summarize a single request. They are emitted only once per request after all processing is complete.
If you've run a Lambda function before, you've seen both types of logs. The application logs tell you what the code is about to do or what it has done with some context:
2023-09-02T11:01:10.005-07:00 INIT_START Runtime Version: nodejs:18.v10 Runtime Version ARN: arn:aws:lambda:us-west-2::runtime:45f492ae7ecdbdc8d85531c46f746fe05a6d2ebdca043c86f79fdb6ae400f8a2
2023-09-02T11:01:11.190-07:00 START RequestId: 31282257-ac1c-4ced-8405-c61866a7b7e4 Version: $LATEST
2023-09-02T11:01:11.787-07:00 END RequestId: 31282257-ac1c-4ced-8405-c61866a7b7e4
Request logs provide a concise summary of each request:
2023-09-02T11:01:11.787-07:00 REPORT RequestId: 31282257-ac1c-4ced-8405-c61866a7b7e4 Duration: 596.56 ms
Billed Duration: 597 ms Memory Size: 256 MB Max Memory Used: 118 MB Init Duration: 1184.65 ms
Only one of these log types helps you quickly answer questions when something goes wrong at scale. Can you guess which one? Probably not yet because the Lambda logs don't have the necessary business context. You can find requests that took a long time, but you don't know who made them or why they failed. You need to add business context to the logs yourself. Let's look at some logs that have business context.
Real world logs
These logs come from a production service that federates users into the AWS console in a role. To federate into the AWS console, the user provides an AWS account and role. The api has the following behavior:
- If the user's token has expired or is missing, they'll be redirected with a 302 to GitHub to login.
- If the user has access to that role, they'll be redirected with a 302 to the AWS console in that role.
- If the user doesn't have access they'll get a 403 Forbidden error.
The application log
{
"timestamp": "2023-09-03T18:12:10.820Z",
"request_id": "6ef98b9b-cb69-43d8-9039-1d68e2461880",
"level": "INFO",
"message": "Validating user token"
}
Application logs have 4 main things.
- Timestamp
"timestamp": "2023-09-03T18:12:10.820Z"
- Request id
"request_id": "6ef98b9b-cb69-43d8-9039-1d68e2461880"
- Log level
"level": "INFO"
- Message
"message": "Validating user token"
Sometimes application logs include the package, the thread id and function name that emitted the log. If there is an exception, they should include the stack trace. You use them like breadcrumbs, if a user reports an error or if the code dies in the middle of a request, you'll be able to see what happened with some context.
Publishing application logs
To publish application logs, use the Lambda Power Tools Logger utility. This is how you would create a log entry assuming you've initialized the logger with const logger = new Logger();
outside of your Lambda handler.
logger.info("Validating user token");
To log an error, just include the Error
:
logger.error("User token is malformed", error);
This is the last we'll talk about application logs because when you are running something at scale, you mainly care about how your service is performing in aggregate and application logs are not useful in that regard.
The request log
A request log collects the important information about a request into one entry. A request log contains metadata like caller information, return codes, request ids, latencies, stack traces and counters. Here is an example:
Note
This log is in the Embedded Metrics Format (EMF). This format allows you to emit metrics and logs simply by writing to the console without the need to make a synchronous call to a remote service that adds a dependency and latency.
2023-08-29T20:06:12.884Z f403b91f-e2e7-40ac-86e6-c7319206010a INFO
{
"_aws": {
"Timestamp": 1693339572884,
"CloudWatchMetrics": [
{
"Namespace": "speedrun", //the namespace and dimensions of the metrics emitted
"Dimensions": [
[
"service", "route" //route is the API the user is calling
]
],
"Metrics": [ //the metric names and types emitted by this log entry
{
"Name": "error",
"Unit": "Count"
},
{
"Name": "fault",
"Unit": "Count"
},
{
"Name": "latency",
"Unit": "Milliseconds"
}
]
}
]
},
"service": "api",
"error": 0, //whether there was a user error with the request
"fault": 0, //whether there was a system error with the request
"latency": 401, //the latency of the request
"startTime": 1693339572483, //when the request started in epoch time
"path": "/v1/federate/051563175502", //the REST API path
"route": "/federate/:account", //the API route
"ip": "94.101.18.209", // the users' client IP
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/116.0", //The users' user-agent
"coldStart": false, //whether this was a cold start
"account": "051563175502", //the account the user is attempting to federate into
"jti": "d7904dbf-3720-4689-b9b9-4915feaf67bd", //the id of the users' JWT token
"login": "perpil", //the users' GitHub login
"role": "speedrun-Administrator", //the role the user is attempting to assume
"requestId": "f039cb44-1b92-438e-9ef1-3130a18bfe08", //lambda request id
"version": 21, //lambda version (corresponds to your deployment)
"xray_trace_id": "1-64fa41e5-27ecc4df60a344964bd4ad73", //xray_trace_id
"statusCode": 302 //the return code of the request
}
Now we've got something to work with. Let's break this down and understand what we are logging. This list acts as a prescription for what you should log.
- The API being called
"route": "/federate/:account"
- Who made the request
"login": "perpil", //the users' GitHub login "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/116.0", //The users' user-agent "ip": "94.101.18.209", // the users' client IP
- Important request information
"account": "051563175502", //the account the user is attempting to federate into "role": "speedrun-Administrator", //the role the user is attempting to assume "path": "/v1/federate/051563175502", //the request path
- The request ids of the request
"requestId": "f039cb44-1b92-438e-9ef1-3130a18bfe08", //lambda request id "xray_trace_id": "1-64fa41e5-27ecc4df60a344964bd4ad73", //xray trace id for further digging into the request "jti": "d7904dbf-3720-4689-b9b9-4915feaf67bd", //the id of the users' JWT token
- The start time
"startTime": 1693339572483, //when the request started in epoch time
- The latency of the request
"latency": 401, //the latency of the request
- Whether the request failed due to user error. Emit 0 if no failure, 1 if failure.
"error": 0, //whether there was a user error with the request
- Whether the request resulted in a system error. Emit 0 if no failure, 1 if failure.
"fault": 0, //whether there was a system error with the request
- Important response details
"statusCode": 302 //the return code of the request
- What version of the code you are running, this is useful in comparing something across versions.
"version": 21, //lambda version (corresponds to your deployment)
Additionally if there is an error set the appropriate field i.e. "fault":1
and log
- The error message
"errorDetails":"TypeError: @ny. is not a function"
- The error type
"errorType": "TypeError"
- The error stack trace
"errorStack": `TypeError: @ny. is not a function at Array.find (<anonymous>) at /var/task/handler.js:398:108 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async /var/task/node_modules/lambda-api/index.js:315:23`
Publishing request logs
To publish request logs, use the Lambda Power Tools Metrics utility.
Initialize it outside of your Lambda handler like this and supply your own namespace and serviceName:
const metrics = new Metrics({
namespace: "speedrun",
serviceName: "api",
});
Use addDimension
to set the api or route in your lambda handler, this allows you to get error, fault and latency metrics per API:
metrics.addDimension("route", routeName);
As you process the request add any important context as you encounter it with addMetadata
metrics.addMetadata("login", token.getUserId());
To add latency, fault and error metrics do something like this in your handler:
let startTime = Date.now();
try {
//do all request processing here
} finally {
//extract the errorDetails from the response if it is user facing
if (response.statusCode >= 400 && response.statusCode < 500) {
let error = undefined;
try {
error = JSON.parse(response.body).error;
} catch (err) {}
error = error || response.body;
metrics.addMetadata('errorDetails', error);
}
}
//user errors are statusCodes >= 400 and < 500
metrics.addMetric(
'error',
MetricUnits.Count,
response.statusCode >= 400 &&
response.statusCode < 500 &&
? 1.0
: 0.0
);
//system errors are statusCodes >= 500
metrics.addMetric(
'fault',
MetricUnits.Count,
response.statusCode >= 500 ? 1.0 : 0.0
);
//add statusCode as metadata
metrics.addMetadata('statusCode', response.statusCode);
//add version as metadata
metrics.addMetadata('version', process.env.AWS_LAMBDA_FUNCTION_VERSION);
const latency = Date.now() - startTime;
//add latency as a metric
metrics.addMetric('latency', MetricUnits.Milliseconds, latency);
//publish the metrics
metrics.publishStoredMetrics();
If you encounter an error, add the details, type and stackTrace using addMetadata
:
metrics.addMetadata("errorDetails", err.message);
metrics.addMetadata("errorType", err.type);
metrics.addMetadata("errorStack", err.stack);
Now that we know what and how to log, let's use it to alarm and answer our questions.
Creating alarms
Since we are publishing metrics and logs at the same time, we can use the fault metric we publish to create an alarm on system errors. We can use CloudWatch Alarms to do this. We can create a fault alarm like AVG(fault) > 5% to get notified every time our faults exceeded 5% of requests for an API.
Answering operational questions
Lets look at how we can answer our operational questions using CloudWatch Logs Insights and our request logs. Cloudwatch Logs Insights allows you to query your logs using a SQL-like syntax.
Whose fault is it
To figure out whether it is our code or a dependency, query for faults and order by count.
filter fault=1 |
stats count() as total by errorType |
sort total desc
Once you've got the top error, get the stack traces for it.
filter errorType = 'TypeError' |
display errorStack
Who is hammering my api
To figure out who is hammering an API
filter ispresent(login) |
stats count() as rps by bin(1s), login, route |
sort rps desc
This will give you the logins hitting your apis with the highest requests per second.
Which one of my dependencies is slow
To figure out which of your dependencies is slow, you need to either use X-Ray or add additional metadata on latency around your dependencies. To add it as metadata, do it like this:
let startTime = Date.now();
try {
//interact with ddb
} finally {
metrics.addMetadata("ddbLatency", Date.now() - startTime);
}
Then run a query like this to get the highest latency requests:
sort latency desc
And expand the rows to see which dependency has the highest latency.
Further reading
The Amazon Builder's Library has an excellent article called "Instrumenting distributed systems for operational visibility" which covers best practices for logging. I've worked with David Yanacek and he is a fountain of knowledge on logging.
Stripe calls what I refer to as request logs as canonical logs. In this post "Canonical log lines: structured logging that’s easy to read" they cover many of the same points.
Boris Tane who built a company Baselime around observability writes "Why is logging so damn hard" and covers why you need a robust observability platform to get the most from your logs.
Look into building Cloudwatch Dashboards for some of the CloudWatch queries above so you can get answers from a dashboard instead of manually running queries during an event.
Finally look into using Cloudwatch Contributor Insights. It allows you to answer questions like "what are my top errors" and "who are my heaviest users" using the request logs we just built.
Now go fix your logs.