A Lambda Dead Letter Queue (DLQ) doesn’t actually capture failed invocations; it’s a destination for events that Lambda failed to deliver to their intended destination after a configured number of retries.
Let’s see it in action. Imagine you have a Lambda function that’s supposed to process SQS messages.
{
"Records": [
{
"messageId": "abcdef12-3456-7890-abcd-ef1234567890",
"receiptHandle": "some-receipt-handle",
"body": "{\"orderId\": \"12345\", \"status\": \"processing\"}",
"attributes": {
"ApproximateReceiveCount": "5",
"SentTimestamp": "1678886400000",
"SenderId": "...",
"ApproximateFirstReceiveTimestamp": "1678886405000"
},
"messageAttributes": {},
"md5OfBody": "...",
"eventSource": "aws:sqs",
"eventSourceARN": "arn:aws:sqs:us-east-1:123456789012:MyProcessingQueue",
"awsRegion": "us-east-1"
}
]
}
If your Lambda function consistently fails to process this message (e.g., it crashes, times out, or returns an error), SQS will retry delivering it. By default, SQS retries up to a certain number of times (configurable, often around 10,000). If after all those retries the message is still undeliverable because the Lambda function always fails, SQS will then move that message to a configured Dead Letter Queue.
Here’s the typical setup:
- Source Queue: An SQS queue (e.g.,
MyProcessingQueue) that receives messages. - Lambda Function: A Lambda function configured to be triggered by the source queue.
- DLQ: Another SQS queue (e.g.,
MyProcessingDLQ) where undeliverable messages from the source queue will be sent.
You configure the DLQ on the source SQS queue, not directly on the Lambda function. The SQS queue’s RedrivePolicy is what specifies the DLQ and the maximum receive count before redrive.
{
"redrivePolicy": {
"maxReceiveCount": 5,
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789012:MyProcessingDLQ"
}
}
When Lambda polls messages from MyProcessingQueue, it receives them. If the Lambda execution fails, Lambda doesn’t immediately delete the message from MyProcessingQueue. Instead, SQS keeps it available for redelivery. After the maxReceiveCount (5 in the example above) is reached for a specific message in MyProcessingQueue, SQS automatically moves that message to MyProcessingDLQ.
The primary problem this solves is preventing an infinite loop of failed processing and ensuring that problematic messages don’t get lost forever. Instead, they are isolated in the DLQ, allowing you to inspect them, understand why they failed, and potentially reprocess them manually or by fixing the underlying issue and replaying them.
The "event" that ends up in the DLQ is the original SQS message body, along with its attributes. You can then set up another process (or manually inspect) the MyProcessingDLQ to see what failed.
The key levers you control are:
maxReceiveCounton the source SQS queue: How many times SQS will try to deliver the message to Lambda before giving up and sending it to the DLQ. This is crucial. Too low, and transient Lambda failures might send messages to the DLQ prematurely. Too high, and messages might be retried for too long, potentially impacting downstream systems or consuming excessive SQS costs.- The DLQ itself: It’s just another SQS queue. You can configure its visibility timeout, retention period, and even its own DLQ if you want to go really deep.
- Lambda’s error handling: While the DLQ is an SQS feature, your Lambda function’s behavior dictates when messages become undeliverable. If your Lambda function throws an unhandled exception, or if it times out, Lambda will not delete the message from the source queue, allowing SQS to manage retries and eventual redrive.
When Lambda is configured as an event source for SQS, it polls messages. If your Lambda function successfully processes a message (i.e., it returns without error or exception), Lambda will then signal SQS to delete that message from the source queue. If your Lambda function fails (throws an error), Lambda does not signal SQS to delete the message. SQS then makes the message visible again after its visibility timeout expires, and it’s available for another receive attempt. This retry cycle continues until the maxReceiveCount is hit, at which point SQS moves the message to the DLQ.
The real magic is that the deadLetterTargetArn is configured on the source SQS queue’s redrive policy, not directly on the Lambda function. Lambda’s role is simply to consume messages from the source queue. If its consumption (execution) fails repeatedly for a given message, SQS, observing the repeated failed deliveries and the increasing ApproximateReceiveCount, eventually enforces the redrive policy.
Many people think you configure the DLQ on the Lambda function itself. This is incorrect for SQS event sources. For other Lambda event sources (like API Gateway or SNS), you can configure a DLQ directly on the Lambda function. In those cases, if Lambda fails to invoke or process the event payload it receives from the service, it sends the original event payload to the Lambda-configured DLQ. For SQS, the DLQ is managed by SQS based on delivery attempts to Lambda.
The next error you’ll hit is an AccessDeniedException if the Lambda execution role doesn’t have sqs:SendMessage permission to the DLQ ARN, or if the SQS queue doesn’t have the necessary sqs:ReceiveMessage, sqs:DeleteMessage, and sqs:SendMessage permissions for the Lambda service to operate correctly.