Reinforcement Learning from Human Feedback (RLHF) is notoriously complex, but a newer technique called Direct Preference Optimization (DPO) achieves similar results with vastly simpler mechanics.

Let’s see DPO in action. Imagine we have a dataset of prompts and desired responses. For each prompt, we also have pairs of responses, one preferred over the other. This preference data is the core of DPO.

Here’s a simplified view of the DPO process. We start with a pre-trained LLM (let’s call it policy_model). We also have a reference model (reference_model), which is typically a copy of the initial policy_model before any fine-tuning. The goal is to adjust policy_model so it’s more likely to generate responses that are preferred in our dataset.

The magic of DPO lies in its loss function. Instead of needing a separate reward model trained on human preferences (the complex part of RLHF), DPO directly optimizes the policy model using the preference pairs. The loss function encourages the policy_model to assign a higher probability to the preferred response and a lower probability to the dispreferred response, relative to the reference_model.

The core idea is that instead of learning a separate reward function, we can directly learn a policy that implicitly optimizes for human preferences. The reference_model acts as a control, preventing the policy_model from deviating too drastically from the original language modeling capabilities. This is crucial; we don’t want to lose the model’s general language understanding while aligning it.

Here’s a snippet of what the DPO loss might look like conceptually:

def dpo_loss(policy_model, reference_model, chosen_response, rejected_response, prompt):
    log_prob_chosen = policy_model.log_probability(prompt, chosen_response)
    log_prob_rejected = policy_model.log_probability(prompt, rejected_response)
    ref_log_prob_chosen = reference_model.log_probability(prompt, chosen_response)
    ref_log_prob_rejected = reference_model.log_probability(prompt, rejected_response)

    # This is a simplified representation; actual implementation uses beta for scaling
    loss = -logsumexp(log_prob_chosen - ref_log_prob_chosen) + \
           logsumexp(log_prob_rejected - ref_log_prob_rejected)
    return loss

The logsumexp function is used here to numerically stabilize the calculation, essentially comparing the relative likelihoods of the chosen and rejected responses under the policy versus the reference model. The beta parameter in real implementations controls the strength of the KL divergence penalty, balancing alignment with preserving the original model’s capabilities.

The training loop then uses this loss to update the policy_model via standard gradient descent. This bypasses the need for sampling from the policy, estimating rewards, and then performing reinforcement learning updates, which are the hallmarks of RLHF and introduce instability and hyperparameter tuning challenges.

The problem DPO solves is the difficulty and cost of collecting detailed human feedback to train a reward model. RLHF requires humans to not just rank responses, but to provide nuanced feedback that a reward model can interpret. DPO can work with simpler "preferred vs. dispreferred" labels, making data collection more scalable.

The exact levers you control are the preference dataset itself and the beta hyperparameter in the DPO loss. A higher beta means you’re more constrained by the reference model, leading to less aggressive alignment but better preservation of general capabilities. A lower beta allows for more aggressive alignment, potentially leading to more "aligned" but less coherent responses if pushed too far.

What most people don’t realize is that DPO is not just a shortcut; it’s a mathematically equivalent formulation of the underlying RLHF objective under certain assumptions. By directly optimizing the policy, it avoids the approximation errors introduced by learning a separate, imperfect reward model.

The next hurdle you’ll face is understanding how to effectively sample from your policy model during inference to leverage the learned preferences.

Want structured learning?

Take the full Llm course →