AWS X-Ray lets you trace requests as they travel through your distributed applications, giving you a visual map of your system’s performance and pinpointing bottlenecks.
Imagine you have a simple API Gateway endpoint that triggers a Lambda function, which in turn calls another Lambda function. Without tracing, if that second Lambda function is slow, you’d only see the total latency on the API Gateway. With X-Ray, you’d see the breakdown: API Gateway latency, the first Lambda’s execution time, and the second Lambda’s execution time, all laid out in a waterfall.
Here’s a basic setup. First, enable active tracing on your API Gateway.
{
"httpApi": {
"route": "GET /hello",
"routeSettings": {
"authorizationType": "NONE",
"integration": {
"integrationType": "AWS_PROXY",
"integrationUri": "arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/arn:aws:lambda:us-east-1:123456789012:function:MyFirstLambda/invocations",
"payloadFormatVersion": "2.0"
}
},
"trace": {
"enabled": true
}
}
}
Next, ensure your Lambda functions have the X-Ray SDK integrated and the tracing mode set to Active.
Lambda Function 1 (MyFirstLambda):
import json
import boto3
xray_recorder = boto3.client('xray')
def lambda_handler(event, context):
# Start a segment for this function
with xray_recorder.in_subsegment('CallSecondLambda') as subsegment:
lambda_client = boto3.client('lambda')
try:
response = lambda_client.invoke(
FunctionName='MySecondLambda',
Payload=json.dumps({"message": "Hello from Lambda 1"}),
InvocationType='RequestResponse'
)
# Record any errors or metadata
if response['StatusCode'] >= 300:
subsegment.add_exception(Exception(f"Lambda 2 returned status {response['StatusCode']}"))
else:
subsegment.put_annotation('SecondLambdaStatus', str(response['StatusCode']))
except Exception as e:
subsegment.add_exception(e)
raise
return {
'statusCode': 200,
'body': json.dumps('Successfully invoked Lambda 2!')
}
Lambda Function 2 (MySecondLambda):
import json
import time
def lambda_handler(event, context):
# Simulate some work
time.sleep(2)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda 2!')
}
In your Lambda function’s configuration, under "Monitoring and operations tools," ensure "AWS X-Ray tracing" is set to "Active."
When a request comes through API Gateway, it initiates a trace. The traceHeader is automatically passed down to the first Lambda function. The X-Ray SDK within the Lambda function then creates a segment for its own execution. When MyFirstLambda invokes MySecondLambda, the X-Ray SDK automatically propagates the trace context. If you manually call another AWS service (like S3, DynamoDB, or another Lambda via boto3), the X-Ray SDK hooks into boto3 and automatically creates subsegments for those calls.
The magic happens when you start adding custom subsegments like CallSecondLambda in the example. This allows you to group related work within a single function’s execution. You can also add annotations (key-value pairs for filtering) and metadata (arbitrary JSON for richer detail) to these segments and subsegments.
The most surprising thing about X-Ray tracing is its ability to automatically instrument many AWS SDK calls without explicit code changes, provided the SDK is initialized correctly and tracing is enabled. You don’t need to wrap every boto3 call in a try-except block with X-Ray SDK calls; it often just works. This automatic instrumentation is crucial for understanding latency in services like S3, DynamoDB, and SQS.
Here’s what a trace might look like in the X-Ray console. You’ll see a service map showing API Gateway calling MyFirstLambda, which then calls MySecondLambda. Each node in the map represents a service, and the lines between them show the requests. Clicking on a node reveals a waterfall view of the trace, detailing the duration of each segment and subsegment, making it easy to spot which function or downstream service is causing delays.
One of the most powerful, yet often overlooked, features is the sampling configuration. By default, X-Ray might not trace every single request to manage costs and performance overhead. You can configure sampling rules to ensure that critical requests, or requests exhibiting specific error patterns, are always traced. For instance, you can set a rule to trace 100% of requests that result in a 5xx error from your Lambda function, ensuring you never miss debugging a production incident.
Once you have tracing set up, the next logical step is to implement distributed fault injection to test how your system behaves under failure conditions.