Webhook Design Patterns for Scale

Webhooks, at their core, are just HTTP POST requests sent from one system to another. The magic, and the mess, happens in the details of making them reliable.

Let’s watch a webhook in action. Imagine a simple e-commerce system. When an order is placed, it needs to notify an external shipping service.

// Incoming Order Event (from e-commerce system)
{
  "event_type": "order.created",
  "timestamp": "2023-10-27T10:00:00Z",
  "data": {
    "order_id": "ORD12345",
    "customer_id": "CUST987",
    "items": [
      {"sku": "SKU001", "quantity": 2},
      {"sku": "SKU005", "quantity": 1}
    ],
    "shipping_address": {
      "street": "123 Main St",
      "city": "Anytown",
      "zip": "12345"
    }
  }
}

The e-commerce system (the sender) will POST this JSON payload to a predefined URL managed by the shipping service (the receiver).

# Example POST request using curl
curl -X POST \
  https://shipping.example.com/webhooks/order_events \
  -H 'Content-Type: application/json' \
  -d '{
    "event_type": "order.created",
    "timestamp": "2023-10-27T10:00:00Z",
    "data": {
      "order_id": "ORD12345",
      "customer_id": "CUST987",
      "items": [
        {"sku": "SKU001", "quantity": 2},
        {"sku": "SKU005", "quantity": 1}
      ],
      "shipping_address": {
        "street": "123 Main St",
        "city": "Anytown",
        "zip": "12345"
      }
    }
  }'

The shipping service receives this, processes it (e.g., creates a new shipment record), and crucially, responds with an HTTP status code. A 200 OK or 201 Created signals success. Anything else (a 500 Internal Server Error, a 400 Bad Request, or a network timeout) signals failure.

This is the fundamental problem webhooks solve: asynchronous communication between systems without needing the sender to constantly poll the receiver. The sender "pushes" data when something happens.

But what happens if the shipping service is down for a minute? Or if the network glitches? The POST request fails. The e-commerce system needs to know this and do something about it. This is where reliability patterns come in.

Signatures: Don’t Get Fooled

The first line of defense is ensuring the webhook payload actually came from where you think it did and hasn’t been tampered with. This is where signatures shine. The sender generates a signature for the payload using a shared secret and includes it in an HTTP header. The receiver then recalculates the signature using the same shared secret and the received payload. If they match, you’re good.

Let’s say the e-commerce system uses HMAC-SHA256.

Sender Side (E-commerce System):

Shared Secret: s3cr3t_k3y_f0r_w3bh00k_s1gn4tur3
Payload: {"event_type": "order.created", ...}

Generate Signature:

echo -n '{"event_type": "order.created", ...}' | openssl dgst -sha256 -hmac 's3cr3t_k3y_f0r_w3bh00k_s1gn4tur3'
# Output: e.g., a6b8c9d0e1f2...

Send Request:

curl -X POST \
  https://shipping.example.com/webhooks/order_events \
  -H 'Content-Type: application/json' \
  -H 'X-Ecomm-Signature: sha256=a6b8c9d0e1f2...' \
  -d '{
    "event_type": "order.created",
    "timestamp": "2023-10-27T10:00:00Z",
    "data": { ... }
  }'

Receiver Side (Shipping Service):

Shared Secret: s3cr3t_k3y_f0r_w3bh00k_s1gn4tur3
Received Payload: {"event_type": "order.created", ...}
Received Signature Header: X-Ecomm-Signature: sha256=a6b8c9d0e1f2...

Recalculate Signature:

import hmac
import hashlib

secret = b's3cr3t_k3y_f0r_w3bh00k_s1gn4tur3'
payload = b'{"event_type": "order.created", ...}' # The raw bytes of the payload
expected_signature = 'a6b8c9d0e1f2...'

calculated_signature = hmac.new(secret, payload, hashlib.sha256).hexdigest()

if hmac.compare_digest(calculated_signature, expected_signature):
    print("Signature is valid!")
else:
    print("Signature is invalid!")

This ensures that the data hasn’t been intercepted and modified.

Retries: The Persistent Messenger

What happens if the shipping service returns a 503 Service Unavailable? The sender must retry. A simple retry isn’t enough; it needs a strategy. Exponential backoff is the standard. Try again after a short delay, then a longer delay, and so on, up to a maximum number of retries or a total time limit.

Sender Side (E-commerce System):

Initial Delay: 1 second
Backoff Factor: 2 (doubles the delay each time)
Max Retries: 5
Max Delay: 60 seconds

If the first attempt fails:

Retry after 1 second.
If that fails, retry after 2 seconds.
If that fails, retry after 4 seconds.
If that fails, retry after 8 seconds.
If that fails, retry after 16 seconds.
If that fails (5 retries total), give up or queue for manual intervention.

This prevents overwhelming a temporarily struggling service while ensuring eventual delivery.

Delivery Guarantees: At Least Once, Not Exactly Once

Webhooks, by their nature, are generally "at least once" delivery. This means a webhook might be delivered more than once. If a webhook is sent, the sender doesn’t get an immediate acknowledgment. It might retry even after the receiver has successfully processed the first attempt, but before the sender received the successful response.

Example Scenario:

E-commerce system sends order.created for ORD12345.
Shipping service receives it, processes it, and sends back 200 OK.
However, the 200 OK response from the shipping service is lost in transit back to the e-commerce system.
The e-commerce system times out waiting for the acknowledgment and assumes the delivery failed.
The e-commerce system retries sending order.created for ORD12345.
The shipping service receives it again.

This is why the receiver must be idempotent. Idempotency means that performing an operation multiple times has the same effect as performing it once. For the shipping service, processing order.created for ORD12345 twice should result in only one shipment being created.

How to achieve idempotency?

Use a unique identifier from the payload (like order_id): When processing an event, check if you’ve already processed an event with that same identifier. If so, silently discard the duplicate.
Store processed event IDs: Keep a record (e.g., in a database table or cache) of the IDs of events you’ve successfully processed. Before processing a new event, query this store.
Database INSERT ... ON CONFLICT DO NOTHING: If you’re inserting into a database table with a unique constraint on your event ID, this SQL command handles duplicates gracefully.

The signature mechanism, when implemented correctly, ensures that the same payload is being validated. The retry mechanism, coupled with idempotency, guarantees that the effect of the payload is applied at least once, and ideally, only once.

The next challenge is managing the volume and failure modes of webhooks at scale, which often leads to building a dedicated webhook dispatch and processing infrastructure.