The most surprising thing about rate limiting is that the "bucket" in Token Bucket and Leaky Bucket is largely a metaphor; the actual implementation often has nothing to do with discrete containers.

Let’s see how these algorithms handle a surge of incoming requests.

Imagine a service that allows only 100 requests per minute.

Token Bucket

  • Concept: A bucket holds tokens. Each token allows one request. Tokens are added to the bucket at a fixed rate. If the bucket is full, new tokens are discarded. When a request arrives, it consumes a token. If no token is available, the request is rejected.
  • Internal Mechanics:
    • tokens: Current number of tokens in the bucket.
    • capacity: Maximum number of tokens the bucket can hold.
    • refill_rate: How many tokens are added per unit of time (e.g., tokens/second).
    • last_refill_time: Timestamp of the last token refill.
  • How it works in action (conceptual):
    • A request comes in.
    • Check if tokens > 0.
    • If yes: decrement tokens, allow request.
    • If no: reject request.
    • Periodically, or on demand before checking tokens:
      • Calculate time elapsed since last_refill_time.
      • Calculate tokens to add: elapsed_time * refill_rate.
      • Add tokens: tokens = min(tokens + tokens_to_add, capacity).
      • Update last_refill_time.
  • Levers you control:
    • capacity: This is the "burstiness" factor. A higher capacity allows for larger bursts of requests, as long as tokens can be replenished.
    • refill_rate: This sets the sustained rate of requests.
  • Example Configuration (conceptual):
    • capacity = 1000 (allows a burst of up to 1000 requests)
    • refill_rate = 10 (tokens per second, meaning 600 per minute)

Leaky Bucket

  • Concept: Requests are added to a queue (the "bucket"). The bucket "leaks" requests at a fixed rate, processing them one by one. If the bucket is full, new requests are rejected.
  • Internal Mechanics:
    • queue: Holds incoming requests.
    • leak_rate: How many requests are processed per unit of time (e.g., requests/second).
    • capacity: Maximum number of requests the queue can hold.
    • last_leak_time: Timestamp of the last request processed.
  • How it works in action (conceptual):
    • A request comes in.
    • Check if queue.size() < capacity.
    • If yes: add request to queue, allow request.
    • If no: reject request.
    • Periodically, or on demand:
      • Calculate time elapsed since last_leak_time.
      • Calculate requests to process: elapsed_time * leak_rate.
      • Process and remove requests from the front of queue up to requests_to_process.
      • Update last_leak_time.
  • Levers you control:
    • leak_rate: This is the only factor determining the output rate. It smooths out traffic regardless of bursts.
    • capacity: Primarily guards against denial-of-service by preventing the queue from growing infinitely.
  • Example Configuration (conceptual):
    • leak_rate = 10 (requests per second, meaning 600 per minute)
    • capacity = 10000 (very large queue to absorb temporary spikes before rejection)

Sliding Window

  • Concept: Unlike the previous two, this algorithm doesn’t use a fixed-size bucket or rate. It tracks requests within a sliding time window. For example, to enforce 100 requests per minute, it counts requests made in the last 60 seconds.
  • Internal Mechanics:
    • window_size: The duration of the time window (e.g., 60 seconds).
    • limit: The maximum number of requests allowed within window_size.
    • timestamps: A list or data structure storing the timestamps of all requests that fall within the current window_size.
  • How it works in action (conceptual):
    • A request comes in at current_time.
    • Remove all timestamps from timestamps that are older than current_time - window_size.
    • If timestamps.size() < limit:
      • Add current_time to timestamps.
      • Allow request.
    • Else:
      • Reject request.
  • Levers you control:
    • window_size: Defines the period over which the rate is measured.
    • limit: The maximum allowed requests within that period.
  • Example Configuration (conceptual):
    • window_size = 60 (seconds)
    • limit = 100 (requests)

The core difference lies in how they handle bursts and smooth traffic. Token Bucket allows bursts by having a capacity for "pre-paid" requests. Leaky Bucket enforces a strict output rate, smoothing traffic aggressively. Sliding Window offers a more precise measurement of recent activity, directly enforcing the limit over the specified lookback period without the conceptual overhead of tokens or queues.

The most common implementation of Sliding Window logs request timestamps and uses a data structure like a Redis sorted set or a circular buffer to efficiently prune old timestamps and count recent ones.

Want structured learning?

Take the full API Architecture course →