CloudWatch Logs Insights lets you query your logs interactively, but its real power is unleashed when you need to sift through terabytes of data generated by thousands of Lambda functions.
Let’s see it in action. Imagine you have a Lambda function that processes orders, and you’re seeing a spike in InternalServerError responses. You’ve got hundreds of these functions across multiple accounts and regions.
{
"eventVersion": "1.0",
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventTime": "2023-10-27T10:00:00.123Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AROAEXAMPLEID"
},
"requestParameters": {
"sourceIPAddress": "192.168.1.100"
},
"responseElements": {
"x-amz-request-id": "EXAMPLE123456789",
"x-amz-id-2": "EXAMPLE/EXAMPLE/EXAMPLE"
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "EXAMPLECONFIGID",
"bucket": {
"name": "my-order-processing-bucket",
"ownerIdentity": {
"principalId": "EXAMPLEBUCKETOWNERID"
},
"arn": "arn:aws:s3:::my-order-processing-bucket"
},
"object": {
"key": "orders/new/12345.json",
"size": 1024,
"eTag": "EXAMPLEETAG",
"sequencer": "EXAMPLESEQUENCER"
}
}
}
Here’s a Lambda function’s log output that might correspond to this event:
START RequestId: a1b2c3d4-e5f6-7890-1234-567890abcdef Version: $LATEST
2023-10-27T10:00:01.456Z a1b2c3d4-e5f6-7890-1234-567890abcdef INFO Processing S3 event for bucket: my-order-processing-bucket, object: orders/new/12345.json
2023-10-27T10:00:02.789Z a1b2c3d4-e5f6-7890-1234-567890abcdef INFO Fetching order data from DynamoDB...
2023-10-27T10:00:03.123Z a1b2c3d4-e5f6-7890-1234-567890abcdef ERROR An unexpected error occurred: Database connection timeout.
2023-10-27T10:00:03.124Z a1b2c3d4-e5f6-7890-1234-567890abcdef ERROR Traceback (most recent call last):
File "/var/task/handler.py", line 55, in lambda_handler
dynamodb.put_item(Item=order_data)
File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 680, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ProvisionedThroughputExceededException) when calling the PutItem operation: The table is too busy.
END RequestId: a1b2c3d4-e5f6-7890-1234-567890abcdef
REPORT RequestId: a1b2c3d4-e5f6-7890-1234-567890abcdef Duration: 2988.50 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 85 MB
To find all instances of this ProvisionedThroughputExceededException, you’d typically navigate to CloudWatch Logs, select your Lambda function’s log group, and start scrolling. But with thousands of functions, this is impossible. CloudWatch Logs Insights changes this.
The core problem Insights solves is making large volumes of log data searchable without needing to pull it all into a separate system. It indexes your log data in near real-time, allowing you to run powerful queries directly within CloudWatch. You can query across multiple log groups simultaneously, which is crucial for distributed systems like microservices running on Lambda.
Here’s how you’d query for that specific error across all your Lambda log groups in us-east-1:
fields @timestamp, @message
| filter @message like /ProvisionedThroughputExceededException/
| sort @timestamp desc
| limit 100
This query tells CloudWatch:
fields @timestamp, @message: Show me the timestamp and the full log message.filter @message like /ProvisionedThroughputExceededException/: Only include log lines that contain the string "ProvisionedThroughputExceededException".sort @timestamp desc: Order the results with the most recent logs first.limit 100: Show me at most 100 results.
The real magic happens when you start aggregating. To see how many times this error has occurred per Lambda function over the last hour:
fields @timestamp, @message
| stats count() as errorCount by bin(5m), @logStream
| filter @message like /ProvisionedThroughputExceededException/
| sort errorCount desc
| limit 100
This query introduces stats and bin():
stats count() as errorCount by bin(5m), @logStream: This is the aggregation. It counts occurrences (count()) and labels this counterrorCount. It then groups these counts into 5-minute intervals (bin(5m)) and by the specific log stream (@logStream), which typically maps to a single Lambda function invocation.filter @message like /ProvisionedThroughputExceededException/: We still filter for the specific error.sort errorCount desc: Order the results by the number of errors, showing the functions with the most errors first.limit 100: Again, limit the output.
This allows you to quickly pinpoint which functions are experiencing the most throughput issues. You can then drill down into those specific log groups or even individual logStreams to investigate further.
Insights offers a rich query language, including parse for extracting structured data from unstructured logs. For example, if your logs were JSON, you could do:
fields @timestamp, @message
| parse @message "{""orderId"": ""*"", ""status"": ""*""}" as orderId, status
| filter status = "FAILED"
| stats count() by orderId
This would extract orderId and status from JSON log lines and count occurrences of failed orders by their ID.
The performance of Insights queries depends heavily on the amount of data scanned. While it’s designed for scale, extremely broad queries over months of data can still take time and incur costs. You’re billed based on the amount of data scanned by your queries. A common pattern is to start with a broad time range and then narrow it down or add more specific filters as you identify patterns.
When you start seeing InternalServerError in your Lambda console, don’t just look at the single invocation. Use CloudWatch Logs Insights to query across all your functions. The most common reason for a surge in generic errors like InternalServerError or ProvisionedThroughputExceededException is not a code bug in one function, but a downstream dependency (like DynamoDB, RDS, or an external API) that is being overwhelmed or is experiencing its own issues. Insights lets you see this widespread impact immediately.
The next logical step after identifying a specific error pattern is to visualize it over time using CloudWatch Metric Filters or by creating custom metrics from your Insights queries.