Provisioned Concurrency for AWS Lambda isn’t just about eliminating cold starts; it’s a mechanism to guarantee a predictable execution environment, even when your function is idle.

Let’s see it in action. Imagine you have a critical API endpoint that needs to respond within 50ms, 99% of the time. You’ve written your Lambda function, and it performs well. But sometimes, the first request after a period of inactivity takes seconds, not milliseconds.

Here’s a function that’s not using Provisioned Concurrency:

import time
import json

def lambda_handler(event, context):
    start_time = time.time()
    # Simulate some work
    time.sleep(0.1) 
    end_time = time.time()
    duration = (end_time - start_time) * 1000 # milliseconds

    print(f"Execution duration: {duration:.2f} ms")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': 'Hello from Lambda!',
            'execution_time_ms': f'{duration:.2f}'
        })
    }

If you invoke this function repeatedly, you’ll see some invocations with durations like 120.50 ms (a "cold start" including initialization) and others around 110.25 ms (a "warm start" reusing the execution environment).

Now, let’s provision concurrency. In the AWS Lambda console, navigate to your function, go to the "Configuration" tab, and select "Concurrency." You’ll see a slider for "Provisioned concurrency."

Let’s say you set "Provisioned concurrency" to 10. This tells AWS to keep 10 execution environments initialized and ready to go at all times for this function.

Here’s what the same function might look like when invoked with Provisioned Concurrency active:

import time
import json

def lambda_handler(event, context):
    # In a real scenario, you might have a flag or check context for initialization
    # but for demonstration, the code remains the same.
    # The *environment* is what's provisioned.
    start_time = time.time()
    # Simulate some work
    time.sleep(0.1) 
    end_time = time.time()
    duration = (end_time - start_time) * 1000 # milliseconds

    print(f"Execution duration: {duration:.2f} ms")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': 'Hello from Lambda!',
            'execution_time_ms': f'{duration:.2f}'
        })
    }

When you now invoke this function, even after a long period of inactivity, the execution time will consistently be around 105.75 ms (the 0.1s sleep plus a tiny overhead). The 10 provisioned environments are already "warm," meaning the runtime is initialized and your code is loaded. AWS will route incoming requests to these pre-initialized environments. If you exceed the provisioned concurrency (e.g., 11 requests arrive simultaneously), the 11th request might experience a cold start, unless you have other concurrency settings in place.

The problem Provisioned Concurrency solves is the variability in latency introduced by cold starts. When a Lambda function is invoked, if no execution environment is ready, AWS must:

  1. Allocate a new execution environment.
  2. Download your function code.
  3. Start the runtime (e.g., Node.js, Python).
  4. Run your function’s initialization code (outside the handler).
  5. Finally, execute your handler function.

This entire process adds latency, which can be unacceptable for latency-sensitive applications. Provisioned Concurrency bypasses steps 1-4 for a specified number of concurrent executions.

The key levers you control are:

  • Provisioned concurrency count: This is the fixed number of concurrent executions you want to keep warm. You pay for this capacity whether it’s used or not.
  • Reserved concurrency: This is a maximum limit on concurrent executions for a function. It also affects how much concurrency is available for other functions in your account. Provisioned concurrency is a subset of reserved concurrency. If you set reserved concurrency to 20 and provisioned concurrency to 10, you can have at most 20 concurrent executions, and 10 of those will always be kept warm.
  • Autoscaling: You can configure Provisioned Concurrency to scale up and down automatically based on metrics like ProvisionedConcurrencyUtilization. This is crucial for balancing cost and performance, ensuring you have enough warm instances during peak times without paying for excessive idle capacity during off-peak hours.

Here’s a crucial detail most people miss: Provisioned Concurrency is per function alias or version. You cannot provision concurrency for the $LATEST version of a function. This is a security and stability measure. When you provision concurrency for an alias (e.g., prod), you are guaranteeing that 10 environments are ready for that specific alias. If you deploy a new version to that alias, the provisioned concurrency applies to the new version. If you’re using versions and promoting them manually, you’d need to configure provisioned concurrency for each alias that points to the version you want to guarantee performance for.

The next concept you’ll grapple with is managing the cost implications of Provisioned Concurrency, especially when combined with autoscaling and fluctuating traffic patterns.

Want structured learning?

Take the full Lambda course →