Rate Limiter Algorithms: Token Bucket vs. Sliding Window

A rate limiter isn’t just about saying "no" to too many requests; it’s fundamentally about managing the predictability of your system’s load.

Let’s see this in action. Imagine a simple API endpoint /users/{id} that we want to limit to 100 requests per minute per user ID.

Here’s a conceptual Python representation:

import time
import collections

class TokenBucketRateLimiter:
    def __init__(self, capacity, fill_rate):
        self.capacity = capacity
        self.fill_rate = fill_rate
        self.tokens = capacity
        self.last_refill_time = time.time()

    def _refill(self):
        now = time.time()
        time_elapsed = now - self.last_refill_time
        tokens_to_add = time_elapsed * self.fill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill_time = now

    def consume(self, amount=1):
        self._refill()
        if self.tokens >= amount:
            self.tokens -= amount
            return True
        return False

class SlidingWindowRateLimiter:
    def __init__(self, limit, window_size):
        self.limit = limit
        self.window_size = window_size
        self.requests = collections.deque() # Stores timestamps of requests

    def consume(self, amount=1):
        now = time.time()

        # Remove timestamps outside the current window
        while self.requests and self.requests[0] < now - self.window_size:
            self.requests.popleft()

        # Check if adding 'amount' exceeds the limit
        if len(self.requests) + amount <= self.limit:
            # Add timestamp for each request consumed
            for _ in range(amount):
                self.requests.append(now)
            return True
        return False

# Example usage for a user:
user_id = "user123"
# Token Bucket: 100 tokens, refills at 100 tokens/minute (100/60 tokens/sec)
token_bucket = TokenBucketRateLimiter(capacity=100, fill_rate=100/60)
# Sliding Window: 100 requests in 60 seconds
sliding_window = SlidingWindowRateLimiter(limit=100, window_size=60)

def handle_request(user_id):
    # Combined approach: first check token bucket, then sliding window
    if token_bucket.consume() and sliding_window.consume():
        print(f"Request for {user_id} allowed.")
        # Process request...
    else:
        print(f"Request for {user_id} denied (rate limited).")

# Simulate requests
for i in range(105):
    handle_request(user_id)
    time.sleep(0.5) # Simulate time between requests

The core problem rate limiting solves is resource contention under variable load. Without it, a sudden surge of traffic can overwhelm your services, leading to cascading failures. Rate limiting provides a predictable service level by smoothing out these peaks.

The Token Bucket algorithm is like a leaky bucket that replenishes itself. You have a bucket with a fixed capacity. Tokens are added to the bucket at a constant fill_rate. To make a request, you must take a token from the bucket. If the bucket is empty, you have to wait or your request is denied. This algorithm allows for bursts of traffic up to the bucket’s capacity, as long as tokens are available. The key parameters are capacity (maximum burst size) and fill_rate (average request rate).

The Sliding Window algorithm, on the other hand, is more about strict adherence to a rate over a specific time window. It keeps a log of request timestamps. When a new request arrives, it discards all timestamps older than the window_size and then checks if adding the new request would exceed the limit within the remaining window. This is often implemented using a deque or a circular buffer. The parameters are limit (maximum requests) and window_size (the duration of the window).

Combining these two is powerful. The Token Bucket handles bursts and provides a smooth average rate, while the Sliding Window enforces a hard cap within any given recent period, preventing a single large burst from consuming all available tokens and then immediately exceeding the rate again. For example, if a user makes 100 requests in the first second of a minute, the Token Bucket might allow it if capacity is high enough, but the Sliding Window would immediately block subsequent requests for the rest of that minute.

You can implement this using a distributed cache like Redis. For Token Bucket, you’d store the tokens count and last_refill_time for each user ID. For Sliding Window, you’d store a sorted set of request timestamps.

The most common mistake is thinking about rate limiting purely in terms of "requests per second." The real challenge is how to handle the distribution of those requests over time, especially with network latency and distributed systems where clocks aren’t perfectly synchronized.

A subtle but critical aspect of sliding window implementations, especially in distributed systems, is how you handle the "sliding" part. If you simply store timestamps and prune old ones, you might end up with a window that’s slightly larger or smaller than intended due to the discrete nature of request arrival. A more robust approach is to use a fixed-size window that "slides" forward, or a counter for the current window and a counter for the previous window, interpolating based on the current time within the window. This ensures a more consistent enforcement of the rate limit.

The next hurdle you’ll face is designing a strategy for handling rejected requests, such as returning a 429 Too Many Requests status code with appropriate Retry-After headers.