A Lambda circuit breaker doesn’t actually break anything; it’s a pattern of gracefully degrading service when a dependency fails, preventing a cascading failure.
Let’s say you have a Lambda function, OrderProcessor, that needs to call a downstream service, InventoryService, to check stock levels before fulfilling an order. If InventoryService is slow or returns errors, OrderProcessor could get bogged down with requests that are doomed to fail, eventually exhausting its own resources or even impacting other parts of your system.
Here’s OrderProcessor without a circuit breaker, just making a direct call to InventoryService:
import boto3
import os
inventory_client = boto3.client('lambda')
inventory_service_function_name = os.environ.get('INVENTORY_SERVICE_FUNCTION_NAME', 'InventoryService')
def lambda_handler(event, context):
order_details = event['order']
try:
# Direct, unprotected call to downstream service
response = inventory_client.invoke(
FunctionName=inventory_service_function_name,
Payload=json.dumps({'item_id': order_details['item_id']})
)
inventory_status = json.loads(response['Payload'].read().decode('utf-8'))
if inventory_status['stock'] > 0:
# Proceed with order fulfillment
print(f"Stock available for {order_details['item_id']}")
return {'status': 'ORDER_FULFILLED', 'order_id': order_details['order_id']}
else:
print(f"Out of stock for {order_details['item_id']}")
return {'status': 'OUT_OF_STOCK', 'order_id': order_details['order_id']}
except Exception as e:
print(f"Error calling inventory service: {e}")
# This is where things go wrong: we just return an error,
# but the upstream system might keep retrying, overwhelming us.
return {'status': 'INVENTORY_CHECK_FAILED', 'order_id': order_details['order_id']}
The problem is that if InventoryService starts failing (e.g., returning 5xx errors, timing out), OrderProcessor will keep making those calls. If OrderProcessor is handling thousands of requests per second, and each call to InventoryService takes 30 seconds to timeout, OrderProcessor can quickly exhaust its available concurrency, leading to all order processing failing, not just the ones that depend on a failing inventory check.
A circuit breaker pattern helps here. It sits between OrderProcessor and InventoryService. It monitors calls to InventoryService. If too many calls fail within a certain period, the circuit breaker "opens," and subsequent calls to InventoryService are immediately rejected without actually being sent. This gives InventoryService time to recover and prevents OrderProcessor from being overwhelmed by failed requests.
Here’s how you might implement a circuit breaker using a library like pybreaker within OrderProcessor:
First, install the library in your Lambda deployment package:
pip install pybreaker
Then, modify OrderProcessor:
import boto3
import os
import json
import pybreaker
import time
inventory_client = boto3.client('lambda')
inventory_service_function_name = os.environ.get('INVENTORY_SERVICE_FUNCTION_NAME', 'InventoryService')
# --- Circuit Breaker Configuration ---
# Call a function to check if the downstream call was a failure
def is_failure(exception):
# We consider timeouts or specific Lambda invoke errors as failures
if isinstance(exception, (TimeoutError, inventory_client.exceptions.ServiceException)):
return True
# You might also inspect the response payload for specific error codes
# if the downstream service returns errors in its payload.
return False
# Configure the circuit breaker:
# - fail_max: Number of failures before opening the circuit
# - reset_timeout: Seconds to wait before attempting a half-open state
# - exclude: Exceptions to NOT count as failures (e.g., client-side errors)
inventory_breaker = pybreaker.CircuitBreaker(
fail_max=5,
reset_timeout=60, # Try to reset after 60 seconds
state_storage=pybreaker.InMemoryStateStorage(), # Default, but explicit
throw_new_error_on_trip=True, # Raise a CircuitBreakerError when open
listeners=[pybreaker.LoggingListener()] # Log state changes
)
# Decorate the function that calls the downstream service
@inventory_breaker
def call_inventory_service(item_id):
try:
response = inventory_client.invoke(
FunctionName=inventory_service_function_name,
Payload=json.dumps({'item_id': item_id}),
InvocationType='RequestResponse', # Ensure we get a response back
# Consider adding a FunctionResponseTimeout if the downstream
# Lambda itself can take a long time. This is NOT the same
# as the Lambda function's overall timeout.
# This is a property of the Invoke API call itself.
# For synchronous calls like this, it's often not directly settable
# at the boto3 invoke level, but the *underlying* HTTP request
# to Lambda has a timeout. If that times out, boto3 will raise
# a timeout error.
)
payload_bytes = response['Payload'].read()
# Check for Lambda Invoke errors that might not raise exceptions directly
if response.get('FunctionError'):
raise Exception(f"Downstream Lambda reported error: {response['FunctionError']}")
inventory_status = json.loads(payload_bytes.decode('utf-8'))
# You might also want to check the *content* of the response for business logic errors
if 'error' in inventory_status and inventory_status['error'] == 'UNAVAILABLE':
raise Exception("Inventory service reported UNAVAILABLE")
return inventory_status
except Exception as e:
# This exception will be caught by the pybreaker decorator
# if it matches our is_failure criteria (or if it's a generic Exception)
print(f"Error during downstream Lambda invocation: {e}")
raise e # Re-raise to be caught by the breaker
def lambda_handler(event, context):
order_details = event['order']
try:
# Now, call the decorated function
inventory_status = call_inventory_service(order_details['item_id'])
if inventory_status['stock'] > 0:
print(f"Stock available for {order_details['item_id']}")
return {'status': 'ORDER_FULFILLED', 'order_id': order_details['order_id']}
else:
print(f"Out of stock for {order_details['item_id']}")
return {'status': 'OUT_OF_STOCK', 'order_id': order_details['order_id']}
except pybreaker.CircuitBreakerError as e:
# The circuit is open, we're failing fast.
print(f"Circuit breaker is open. Downstream service unavailable. Error: {e}")
return {'status': 'DOWNSTREAM_UNAVAILABLE', 'order_id': order_details['order_id']}
except Exception as e:
# Other unexpected errors, or errors that the breaker didn't catch as failures
print(f"An unexpected error occurred: {e}")
return {'status': 'PROCESSING_ERROR', 'order_id': order_details['order_id']}
The pybreaker library handles the state transitions. When call_inventory_service is decorated, pybreaker wraps it. It tracks the number of exceptions raised by call_inventory_service. If fail_max (e.g., 5) consecutive calls fail, the breaker "opens." For the next reset_timeout seconds (e.g., 60 seconds), any further calls to call_inventory_service will immediately raise a pybreaker.CircuitBreakerError without executing the actual inventory_client.invoke code. After the timeout, the breaker enters a "half-open" state, allowing a single call. If that call succeeds, the breaker "closes"; if it fails, it "opens" again for another reset_timeout period.
The most surprising thing about circuit breakers is that their primary benefit isn’t just handling failures, but preventing them from propagating and causing even greater system instability. They act as a pressure release valve for your services.
Here’s a simplified view of InventoryService that OrderProcessor might be calling:
# InventoryService Lambda Function (simplified)
import json
import time
import random
def lambda_handler(event, context):
item_id = event.get('item_id')
if not item_id:
return {'error': 'MISSING_ITEM_ID'}
# Simulate intermittent failures
if random.random() < 0.3: # 30% chance of failure
print(f"Simulating failure for item: {item_id}")
# This could be a timeout, a connection error, or a specific error code
# For a true network failure, the boto3 invoke might raise an exception.
# Here, we'll simulate a "service error" that might be returned in the payload.
return {'error': 'TEMPORARILY_UNAVAILABLE', 'message': 'Service is overloaded'}
# Simulate slow responses
if random.random() < 0.2: # 20% chance of slow response
sleep_time = random.uniform(10, 25) # Sleep for 10-25 seconds
print(f"Simulating slow response for item: {item_id} (sleeping {sleep_time:.2f}s)")
time.sleep(sleep_time)
# Simulate successful response
stock_level = random.randint(0, 100)
print(f"Stock for {item_id}: {stock_level}")
return {'item_id': item_id, 'stock': stock_level}
When InventoryService starts failing, OrderProcessor’s call_inventory_service will start raising exceptions. pybreaker catches these. If 5 exceptions happen quickly, the breaker opens. Now, OrderProcessor’s lambda_handler will catch pybreaker.CircuitBreakerError and return {'status': 'DOWNSTREAM_UNAVAILABLE'} immediately. This is much better than waiting for boto3.client('lambda').invoke to timeout, which could take seconds per failed request, potentially exhausting the Lambda execution environment or hitting concurrency limits.
The InvocationType='RequestResponse' is crucial. It ensures that the invoke call waits for the downstream Lambda to complete and returns its payload. If you used Event or DryRun, you wouldn’t get the result needed to check for success or failure in the same invocation.
The FunctionResponseTimeout mentioned in the invoke call’s comments is a subtle point. The boto3 invoke API itself doesn’t have a direct Timeout parameter in the way some other AWS SDK calls do for their request to the AWS API endpoint. However, the underlying HTTP request to the Lambda service does have a timeout. If that HTTP request times out (e.g., if the downstream Lambda function runs longer than its own configured timeout and the Lambda service terminates it, or if there’s network latency), boto3 will typically raise a botocore.exceptions.ReadTimeoutError or similar. This exception would then be caught by our pybreaker and counted as a failure.
The is_failure function is where you define what constitutes a "failure" for the circuit breaker. It’s important to be precise here. You might want to distinguish between transient network errors (which should trip the breaker) and application-level errors that are expected but mean a specific request can’t be fulfilled (which might not trip the breaker, or might be handled differently).
One critical aspect of circuit breakers, especially in distributed systems like AWS Lambda, is managing their state. pybreaker defaults to InMemoryStateStorage, which means the breaker’s state (open, closed, half-open) is lost when the Lambda execution environment is recycled. For a Lambda function that is invoked infrequently or has a very short lifespan, this might be acceptable. However, if you need the circuit breaker’s state to persist across invocations for the same downstream dependency, you’d need a shared state mechanism. This could be an external store like Redis (pybreaker.RedisStateStorage), DynamoDB, or even a shared S3 object. In this scenario, InventoryService might be down for an hour, and you want the circuit breaker in OrderProcessor to remember that for the entire hour, not just for the duration of a single Lambda execution.
The next logical step after implementing a circuit breaker is to consider how to handle the DOWNSTREAM_UNAVAILABLE response. This might involve retrying the order later, notifying a human operator, or queueing the request for asynchronous processing once the dependency is restored.