Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are not just different flavors of fine-tuning; they represent a fundamental shift in how we imbue large language models with human-aligned values, moving from describing desired behavior to learning it through direct comparison.

Let’s see this in action. Imagine we have a base LLM that’s good at writing, but we want it to be more helpful and less likely to generate toxic content. We’ve collected some preference data: for a given prompt, we have a "chosen" response (what we want) and a "rejected" response (what we don’t want).

Here’s how you’d set up a DPO training run using Hugging Face’s trl library. We’ll start with a SFTTrainer for initial Supervised Fine-Tuning (SFT) and then transition to DPOTrainer.

First, the essentials:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, DPOTrainer
import torch

# Load your preference dataset
# This dataset should have columns like 'prompt', 'chosen', 'rejected'
dataset = load_dataset("your_preference_dataset", split="train")

# Load your base model and tokenizer
model_name = "gpt2" # Or any other causal LM
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Common practice for GPT-like models

# Define SFT training arguments (optional but recommended for a good starting point)
sft_args = TrainingArguments(
    output_dir="./sft_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=50,
    push_to_hub=False,
)

# Initialize SFT trainer
sft_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_args,
    train_dataset=dataset,
    dataset_text_field="text", # Assuming your dataset has a combined 'text' field for SFT
    max_seq_length=512,
)

# Perform SFT
print("Starting Supervised Fine-Tuning...")
sft_trainer.train()
print("SFT complete. Saving model...")
sft_trainer.save_model("./sft_model")

# Now, prepare for DPO
# For DPO, we need a dataset with 'prompt', 'chosen', and 'rejected' columns.
# If your dataset is already in this format, you can use it directly.
# Otherwise, you might need to preprocess it.
# Let's assume 'dataset' has these columns.

# Define DPO training arguments
dpo_args = TrainingArguments(
    output_dir="./dpo_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=1e-5, # Typically lower for DPO
    num_train_epochs=3,
    logging_steps=10,
    save_steps=50,
    push_to_hub=False,
    remove_unused_columns=False, # Important for DPO
)

# Initialize DPOTrainer
# Note: We load the SFT-tuned model for DPO.
model = AutoModelForCausalLM.from_pretrained("./sft_model", torch_dtype=torch.bfloat16)

dpo_trainer = DPOTrainer(
    model,
    ref_model=None, # If None, it's automatically initialized from model
    args=dpo_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    max_length=512,
    # Ensure your dataset has 'prompt', 'chosen', 'rejected' keys
    # Or map them if they are named differently
    # Example if your dataset has 'instruction', 'response_positive', 'response_negative'
    # dataset_kwargs={"prompt_key": "instruction", "chosen_key": "response_positive", "rejected_key": "response_negative"}
)

# Perform DPO
print("Starting Direct Preference Optimization...")
dpo_trainer.train()
print("DPO complete. Saving final model...")
dpo_trainer.save_model("./dpo_final_model")

The core idea behind RLHF and DPO is to align the LLM’s output with human preferences. Instead of just predicting the next token based on a large corpus, the model learns to predict responses that humans prefer.

RLHF typically involves three stages:

  1. Supervised Fine-Tuning (SFT): Train a base LLM on a dataset of high-quality prompt-response pairs. This gives the model a good starting point for generating coherent text.
  2. Reward Model (RM) Training: Train a separate model (or use the LLM itself) to predict a scalar reward for a given prompt-response pair. This RM is trained on human preference data (which response is better).
  3. Reinforcement Learning (RL) Fine-Tuning: Use the RM as a reward function to fine-tune the SFT model using RL algorithms like PPO (Proximal Policy Optimization). The LLM is treated as an agent that generates responses, and it’s rewarded for generating responses that the RM scores highly.

DPO streamlines this. It bypasses the explicit reward model training and RL steps. Instead, DPO directly optimizes the LLM using the preference data. It reformulates the RL objective into a classification loss on the preference pairs. For each pair (prompt, chosen_response, rejected_response), DPO trains the LLM to increase the probability of chosen_response and decrease the probability of rejected_response relative to a reference model (often the initial SFT model). The trl library’s DPOTrainer handles this by treating the problem as a binary classification task: given a prompt, classify if the response is the "chosen" or "rejected" one, but weighted by the probabilities from the reference model to avoid large policy shifts.

The DPOTrainer takes your dataset which must contain at least prompt, chosen, and rejected keys. The max_length parameter defines the maximum sequence length for tokenization and model input. The ref_model is crucial: it’s a frozen copy of the model before DPO starts. The DPO loss compares the probabilities assigned to the chosen and rejected responses by the current model against those assigned by the ref_model. This comparison acts as a proxy for the reward signal, guiding the model to produce preferred outputs without needing a separate reward model.

The remove_unused_columns=False in TrainingArguments is important because DPOTrainer expects specific columns (prompt, chosen, rejected) to be present in the dataset, and this argument prevents them from being dropped if they aren’t standard transformers training arguments.

The most surprising thing about DPO is how it transforms a complex, multi-stage RL problem into a single, stable classification task. By framing the objective as maximizing the log-probability of the chosen response relative to the rejected response, weighted by the reference model’s probabilities, it implicitly optimizes for a reward function without ever explicitly training one. This makes the training process significantly simpler and more stable than traditional RLHF.

The next concept you’ll likely encounter is evaluating the quality of your aligned model, which involves designing robust metrics and human evaluation protocols to assess not just helpfulness and harmlessness but also nuances like creativity, factual accuracy, and adherence to specific stylistic guidelines.

Want structured learning?

Take the full Huggingface course →