GenerationConfig and sampling parameters are how you tell a Hugging Face transformers model how to generate text, not what text to generate. They control the process of picking the next word, which is where all the interesting variability comes from.

Let’s see it in action. Imagine we have a simple prompt and want to generate a continuation.

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

# Load a small, fast model for demonstration
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "The quick brown fox jumps over the"

# --- Default Generation ---
print("--- Default Generation ---")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
default_output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(default_output[0], skip_special_tokens=True))

# --- Generation with specific config ---
print("\n--- Generation with specific config ---")
# We'll define a GenerationConfig object
generation_config = GenerationConfig(
    max_new_tokens=20,  # Generate 20 new tokens, not a total of 50
    do_sample=True,     # Enable sampling
    temperature=0.7,    # Control randomness
    top_k=50,           # Consider only the top 50 most likely tokens
    top_p=0.9,          # Consider tokens that make up 90% of the probability mass
    repetition_penalty=1.2 # Penalize repeating tokens
)

# Pass the config object to the generate method
custom_output = model.generate(input_ids, generation_config=generation_config)
print(tokenizer.decode(custom_output[0], skip_special_tokens=True))

The default generation uses a greedy approach: always pick the single most probable next token. This is deterministic and often leads to repetitive or bland output. The second example, however, uses GenerationConfig to introduce sampling. We tell it to generate a maximum of 20 new tokens (rather than a total max_length), enable sampling (do_sample=True), and set parameters like temperature, top_k, and top_p to influence how the sampling happens. The repetition_penalty is a common addition to discourage the model from getting stuck in loops.

The core problem these parameters solve is the transition from a model that outputs probabilities for every possible next token to a single, coherent sequence of tokens. A language model, at its heart, predicts the probability distribution of the next token given the preceding sequence. For example, after "The quick brown fox jumps over the", the model might assign probabilities like:

  • lazy (0.3)
  • dog (0.25)
  • moon (0.1)
  • fence (0.05)
  • … and thousands more.

Without sampling, the model would deterministically pick lazy every single time. Sampling introduces randomness, allowing dog or even moon to be chosen, leading to more varied and often more creative outputs.

The GenerationConfig object is a convenient way to bundle these settings. You can also pass these parameters directly to the generate method.

  • max_new_tokens: This is a crucial distinction from max_length. max_length sets the total length of the output sequence (prompt + generation), while max_new_tokens sets the number of tokens to generate after the prompt. This is often more intuitive.
  • do_sample: If False (the default), the model uses greedy decoding or beam search. If True, it samples from the probability distribution.
  • temperature: A higher temperature (e.g., 1.0 or more) makes the probability distribution flatter, increasing randomness and the chance of picking less likely tokens. A lower temperature (e.g., 0.2) makes the distribution sharper, favoring more probable tokens and leading to more focused output. It’s a scaling factor applied to the logits before the softmax.
  • top_k: Limits sampling to the k most likely next tokens. If k=50, the model will only consider the 50 tokens with the highest probabilities for sampling. This prevents very low-probability, nonsensical tokens from being chosen.
  • top_p (nucleus sampling): Instead of a fixed number of tokens, top_p selects tokens whose cumulative probability mass exceeds p. For example, if p=0.9, it will consider the smallest set of tokens whose probabilities sum to at least 0.9. This is often preferred over top_k because the number of tokens considered adapts dynamically to the shape of the probability distribution. If one token is overwhelmingly likely, top_p might select only that one; if many tokens are nearly equally likely, it will select more.
  • repetition_penalty: A value greater than 1.0 will divide the logits of tokens that have already appeared in the generated sequence (or prompt, depending on implementation details) by this penalty. This discourages repetition. A value of 1.2 means tokens that have appeared recently are 20% less likely to be chosen again.
  • num_beams: If num_beams > 1 and do_sample=False, the model uses beam search. Beam search keeps track of the num_beams most likely sequences at each step, exploring multiple paths rather than just one (greedy). It generally produces higher-quality, more coherent text than greedy decoding but is slower. If do_sample=True, num_beams is typically ignored or has less impact.

You can combine these parameters. For instance, do_sample=True, temperature=0.7, top_k=50, and top_p=0.9 is a very common and effective combination for generating creative yet coherent text.

The next conceptual hurdle is understanding how these sampling strategies interact with beam search, and when to use one over the other.

Want structured learning?

Take the full Huggingface course →