Lambda and Step Functions are a killer combo for building robust, serverless workflows, but most people think of them as just "calling one Lambda after another." The real magic, and where things get surprisingly powerful, is in how Step Functions manages state, retries, and error handling outside of your Lambda code.

Let’s watch a simple state machine in action. Imagine we’re processing an order:

{
  "Comment": "Order Processing Workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validateOrderFunction",
      "Next": "ProcessPayment",
      "Catch": [
        {
          "ErrorEquals": ["ValidationError"],
          "Next": "NotifyOrderError"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:processPaymentFunction",
      "Next": "ShipOrder",
      "Retry": [
        {
          "ErrorEquals": ["PaymentProcessingError", "ServiceUnavailable"],
          "IntervalSeconds": 5,
          "MaxAttempts": 3,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["PaymentFailedError"],
          "Next": "NotifyOrderError"
        }
      ]
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:shipOrderFunction",
      "End": true
    },
    "NotifyOrderError": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notifyOrderErrorFunction",
      "End": true
    }
  }
}

When an order comes in, Step Functions triggers ValidateOrder. If ValidateOrder succeeds, it passes its output to ProcessPayment. If ValidateOrder throws a ValidationError, Step Functions catches it and transitions to NotifyOrderError.

Notice ProcessPayment has a Retry block. If it fails with a PaymentProcessingError or ServiceUnavailable, Step Functions will automatically wait 5 seconds, then try again. If it fails again, it waits 10 seconds (5 * 2), then tries a third time. If all three attempts fail, then it will move to the Catch block and transition to NotifyOrderError. This resilience is built into the state machine definition, not your Lambda code.

The mental model here is that Step Functions is the conductor, and your Lambdas are the musicians. The conductor dictates the tempo, decides when to repeat a section, and knows what to do if a musician hits a wrong note. Your Lambdas just focus on playing their part.

You control the flow using the Amazon States Language (ASL), a JSON-based domain-specific language. Key elements include:

  • States: The fundamental building blocks. Task states invoke other AWS services (like Lambda), Choice states make decisions, Parallel states run branches concurrently, Wait states pause execution, and Succeed/Fail states end the execution.
  • Transitions: How the workflow moves from one state to another (Next, End).
  • Input/Output Processing: You can transform the data passed between states using InputPath, OutputPath, and ResultPath. This is incredibly powerful for shaping the data without needing custom logic in your Lambdas. For example, ResultPath: "$.paymentResult" would add the output of a task to a paymentResult field within the overall state data, rather than overwriting the entire payload.
  • Error Handling: Catch blocks define how to handle specific errors thrown by tasks.
  • Retries: Retry blocks automatically re-execute failed tasks with configurable backoff strategies.

The thing most people miss is how deeply you can integrate ASL with the state data. You’re not just passing raw JSON. You can use JSONPath expressions within InputPath, OutputPath, Parameters, and ResultSelector to dynamically pull specific fields from the state, construct new JSON objects, and inject parameters into your Lambda invocations. This means your Lambdas can be simpler, receiving only the precise data they need, and your workflow logic handles the complex data manipulation.

The next concept to explore is how to handle long-running processes and coordinate multiple independent workflows using Step Functions’ StartExecution API call and output integration.

Want structured learning?

Take the full Lambda course →