Speculative decoding in Hugging Face isn’t just a performance trick; it’s a fundamental shift in how we generate text, allowing models to "guess" ahead and significantly slash latency by reducing the number of expensive, full model inferences.

Let’s see it in action. Imagine we have a simple pipeline for generating text.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "The quick brown fox jumps over the lazy"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Standard generation
print("Standard Generation:")
output_standard = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output_standard[0], skip_special_tokens=True))

# Speculative Decoding
from transformers import GenerationConfig

# We need a smaller, faster "draft" model for speculative decoding.
# For demonstration, we'll use the same model but in a real scenario,
# you'd use a much smaller model.
draft_model = AutoModelForCausalLM.from_pretrained(model_name)

generation_config = GenerationConfig(
    do_sample=False, # Greedy decoding for simplicity
    num_beams=1,     # Single beam for simplicity
    max_new_tokens=20,
    # Crucially, enable speculative decoding and specify the draft model
    speculative_decoding=True,
    draft_model=draft_model,
    # The number of tokens the draft model will predict in advance.
    # Higher values can be faster but require a good draft model.
    num_draft_tokens=5
)

print("\nSpeculative Decoding Generation:")
output_speculative = model.generate(
    input_ids,
    generation_config=generation_config
)
print(tokenizer.decode(output_speculative[0], skip_special_tokens=True))

What’s happening under the hood? Standard autoregressive generation involves a loop: the model takes the current sequence, predicts the next token, appends it, and repeats. This means for every single token generated, you perform a full forward pass through the model, which is computationally expensive.

Speculative decoding introduces a smaller, faster "draft" model. This draft model generates a sequence of potential future tokens (e.g., 5 tokens ahead). The main, larger model then takes this sequence of draft tokens and processes them in a single, batched forward pass, but with a clever trick: it predicts the probability distribution for each of the draft tokens simultaneously.

Think of it like this: the draft model proposes "fox jumps over the lazy dog." The main model then quickly checks if each of these proposed tokens ("jumps," "over," "the," "lazy," "dog") is indeed a likely continuation. If the draft model is good, many of these tokens will be accepted. For each accepted token, the main model doesn’t need to do a separate full forward pass. It only needs to perform a full forward pass when the draft model makes a mistake and the main model rejects a proposed token. The main model then generates the correct token and the process restarts with the draft model predicting the next sequence. This drastically reduces the number of expensive full forward passes.

The key levers you control are:

  1. speculative_decoding=True: This flag enables the feature.
  2. draft_model: You must provide a Hugging Face model instance that is smaller and faster than your main model. The quality of this draft model is paramount. A draft model that’s too "dumb" will lead to frequent rejections and little speedup. A draft model that’s too complex might not be fast enough to justify its use. Often, a distilled version or an earlier checkpoint of your main model works well.
  3. num_draft_tokens: This integer dictates how many tokens the draft_model will predict in advance. A higher number means more potential tokens to check in a single batch by the main model. This can lead to greater speedups if the draft model is accurate, but also increases the complexity of the main model’s batched check. Experimentation is key here to find the sweet spot for your specific models and hardware.
  4. do_sample and num_beams: Speculative decoding works best with greedy or beam search decoding strategies. Sampling can introduce more randomness, making it harder for the draft model to predict accurately and leading to lower acceptance rates.

The most impactful aspect of speculative decoding, and often overlooked, is how the acceptance/rejection of draft tokens is handled probabilistically. When the main model evaluates the sequence of num_draft_tokens, it doesn’t just check if the exact sequence is likely. Instead, it uses the draft model’s predicted probabilities to guide a sampling process (even in greedy/beam search). For each position i in the draft sequence, the main model calculates the probability of token draft_tokens[i] given the prefix. Tokens are accepted sequentially as long as their probability, as assessed by the main model, is high enough. This probabilistic acceptance, rather than a strict "all or nothing" check, is what allows for efficient acceptance of parts of a draft sequence.

The next challenge is efficiently selecting the right draft_model for your primary model to maximize this speedup.

Want structured learning?

Take the full Huggingface course →