Training a reward model from human preferences is a surprisingly effective way to align large language models with desired behaviors, even when those behaviors are hard to define algorithmically.
Let’s see this in action with a concrete example. Imagine we want a model that, given a prompt, generates a helpful and concise answer. We’ll use Hugging Face’s trl library, which simplifies this process.
First, we need some data. This data consists of prompts, and for each prompt, multiple responses generated by a base language model. Crucially, we also have human preferences indicating which response is "better."
from datasets import load_dataset
# Load a sample dataset with prompts and ranked responses
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:1000]") # Using a subset for demonstration
# The dataset typically looks like this (simplified):
# {
# "prompt": "Write a short story about a brave knight.",
# "chosen": "Sir Reginald, a knight of renowned valor, faced the dragon...",
# "rejected": "Once upon a time, there was a knight named Bob. He was not very brave..."
# }
The core idea is to train a model (the reward model) to predict which response a human would prefer. It learns this by minimizing the difference in predicted scores between the "chosen" and "rejected" responses.
Here’s how we set up the training:
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments
# Load a pre-trained model and tokenizer to serve as our reward model
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# We need to add a padding token if the model doesn't have one
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
# Define the training arguments
training_args = RewardConfig(
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=1,
learning_rate=1e-4,
evaluation_strategy="no", # No evaluation for this simple example
logging_steps=10,
output_dir="./reward_model_output",
report_to="none", # Disable reporting for simplicity
)
# Initialize the RewardTrainer
trainer = RewardTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
)
# Start training!
trainer.train()
The RewardTrainer takes care of the complex loss calculation. It feeds the prompt and each response separately into the reward model, gets a scalar score for each, and then optimizes the model to ensure the score for the chosen response is higher than the score for the rejected response. The loss function is typically a form of logistic loss, aiming to maximize the probability that the chosen response is preferred.
The reward model itself is usually a standard transformer model (like distilbert-base-uncased in this example) with a linear layer on top that outputs a single scalar value. This scalar represents the "goodness" or "preference score" of a given text. When training, you pass the prompt and a response through this model. The model then outputs a score. The trainer compares scores for chosen and rejected responses for the same prompt.
The mental model here is that you’re teaching a sophisticated classifier to understand nuanced human judgment. Instead of explicit rules, you’re providing examples of what good looks like. The model learns to generalize this concept of "goodness" across new prompts and responses. The prompt is concatenated with the response before being fed into the model. This allows the model to consider the context of the prompt when scoring the response.
A key detail often overlooked is how the prompt is handled. The RewardTrainer by default concatenates the prompt and the response, separating them with a special token (e.g., </s>). This means the reward model isn’t just scoring the response in isolation; it’s scoring the response in the context of the given prompt. This contextual awareness is vital for generating relevant and helpful outputs.
After training, this reward model can be used in reinforcement learning loops, like Proximal Policy Optimization (PPO), to fine-tune a language model. The language model generates responses, the reward model scores them, and the PPO algorithm updates the language model to produce responses that consistently get high scores from the reward model.
The next step you’ll likely encounter is integrating this trained reward model into a reinforcement learning pipeline to fine-tune your actual language generation model.