Visual Guides

Post-Training Beyond RLHF

RLHF proved alignment works — then researchers asked what the minimum ingredients are.

DPOGRPOConstitutional AIAlignmentLLMs20 min

01RLHF Works. So What's the Problem?

The full pipeline

If you read the RLHF guide, you saw how InstructGPT turns a raw language model into a helpful assistant. It works. ChatGPT, Claude, and Gemini all use some version of this process.

But the full RLHF pipeline is heavy. It needs four separate models loaded into GPU memory at the same time: the policy you're training, a frozen copy of it as a reference, a reward model, and a value network that PPO uses to estimate advantages. On top of that, PPO itself is hard to tune. The clipping range, KL penalty weight, learning rate schedules, and batch sizes all interact, and getting them wrong can cause the whole run to diverge.

So researchers started asking a natural question: which of these pieces are actually necessary? Can you remove the reward model? The value network? The human labelers? The RL loop entirely? The rest of this guide follows that question.

RLHF Memory Footprint

4 concurrent models \u00b7 7B params each \u00b7 tap a card to learn its role

Total GPU memory (minimum, no parallelism)
~64 GBVRAM

Real-world footprint is higher: optimizer states (Adam = 3\u00d7 params), activations for backprop, and framework overhead push a 7B RLHF run to 2\u20134\u00d7 A100s. DPO and GRPO eliminate the value network and reference model copies respectively, cutting this significantly.

Watch Out

For a 70B model, the four-model setup needs 280+ GB of VRAM just for weights, before optimizer states and activations. This is why most open-source alignment work has been limited to 7-13B models. Reducing the number of models isn't just convenient, it changes who can do alignment at all.

02DPO: What If the Reward Model Is Unnecessary?

Drop the reward model

The first piece to go was the reward model. Rafailov et al. (2023) noticed something about the math behind RLHF: if you write out the formula for the optimal policy (the best possible model given a reward function and a KL penalty), the reward model is already baked into it. You don't need to train it separately.

In practical terms, this means you can skip the reward model and the RL loop entirely. Instead, you train directly on preference pairs (response A is better than response B) using a supervised loss function. The pipeline goes from four models down to two: the policy you're training and a frozen reference copy.

This method is called Direct Preference Optimization (DPO). Toggle between the two pipelines to see what got removed:

3 stages · 4 models · RL sampling loop

Preference Datachosen vs. rejected pairs
Train Reward Modellearns a scalar score function
4 models in VRAM

PPO Loop

Active Policy
Value Net
Reward Model
Reference Policy
sampling loop
Aligned ModelRLHF-tuned LLM

Reward model is a separate frozen network; PPO samples, scores, and updates in a tight loop — expensive and unstable.

RLHF
  • · Explicit reward model
  • · PPO sampling overhead
  • · 4× VRAM footprint
  • · Reward hacking risk
DPO
  • · Reward implicit in loss
  • · Supervised fine-tuning only
  • · 2× VRAM footprint
  • · Simpler, more stable

Key Insight

DPO didn't just simplify the pipeline. It changed who can do alignment research. A team with 2 A100 GPUs can DPO-train a 7B model. The same work with full RLHF would need 4-8 GPUs at minimum. Within months of the paper, DPO became the default approach for open-source model alignment.

03The Tradeoffs of Removing the Reward Model

Nothing is free

Removing the reward model made training simpler, but it also removed a safety valve. In RLHF, the reward model acts as an explicit judge. You can inspect it, add length penalties to it, or tune its behavior. DPO folds all of that into the loss function, which means you have less control over what the model actually optimizes for.

Three problems showed up in practice:

  • 1.Length bias. Human annotators tend to prefer longer responses. DPO learns this correlation directly and produces increasingly verbose output.
  • 2.Likelihood displacement. DPO's loss only cares about the gap between the chosen and rejected response probabilities. Both can drop, as long as the gap widens. So the model can actually become less likely to produce the good response in absolute terms. (Pal et al., 2024)
  • 3.Offline overfitting. DPO trains on a fixed dataset with no exploration. The model never generates its own responses during training, so it can memorize patterns in the data rather than learning general preferences.

Explore each failure mode in the tabs below:

DPO Failure Modes

Three ways Direct Preference Optimization breaks down in practice

training steps →0%50%100%
Response LengthResponse Quality
early traininglate training

Model learns verbosity correlates with preference in training data — longer responses win even when quality plateaus or drops.

04Fixing DPO: IPO, KTO, SimPO, ORPO

The variant ecosystem

Each of those failure modes led to a targeted fix. And some researchers went further, asking whether you could strip away even more of the pipeline.

IPO bounds the reward margin so probabilities can't collapse. KTO drops the requirement for paired data entirely, working with simple thumbs-up/thumbs-down feedback. SimPO removes the reference model (down to just one model in memory) and adds length normalization. ORPO merges SFT and alignment into a single training stage.

Click each card to see what it changes and what it requires:

DPO Family Tree

Click a variant to see what makes it distinct.

IPO(2024)

Bounded margin prevents overoptimization

KTO(2024)

Learn from thumbs up/down, no pairs needed

SimPO(2024)

Length-normalized, no reference model

ORPO(2024)

Combines SFT and alignment in one step

Note

This isn't fragmentation. It's the natural result of researchers asking “what else can we remove?” Each variant makes a different tradeoff between simplicity, data requirements, and alignment quality. The right choice depends on your constraints.

05Constitutional AI: What If You Don't Need Human Labelers?

Drop the humans

DPO and its variants simplified the training machinery but still relied on human-generated preference data. Anthropic's Constitutional AI (Bai et al., 2022) asked a different question: what if you could remove the humans from the loop?

The idea is straightforward. You write a set of principles (the “constitution”) that describe what good behavior looks like. Then you have the model generate a response, critique its own response using those principles, and write a revised version. The original and revised responses become a preference pair, and you have training data without any human annotators.

This is called RLAIF (RL from AI feedback). Step through the process below:

Constitutional AI — Critique-Revision Loop

Step 1 of 5
1. Harmful Prompt

User Input

"How do I pick a lock?"

The model receives a prompt that could elicit harmful or dangerous content. Constitutional AI begins its critique-revision loop before committing to a final response.

Key Insight

The power here is scale. You write 16 principles once, and the model can generate millions of preference pairs from them. Compare that to paying a team of annotators to label 500,000 comparisons. The tradeoff is that the model's self-critique is only as good as its ability to reason about the principles. If it can't reliably apply a principle, the training data will be noisy.

06GRPO: What If You Don't Need a Critic?

Drop the critic

DPO removed the reward model and the RL loop entirely, but it also lost something: the ability to learn from the model's own outputs during training (on-policy learning). That matters because a model can discover better strategies by trying things and seeing what works, instead of only learning from a fixed dataset.

DeepSeek's Group Relative Policy Optimization (Shao et al., 2024) found a middle ground. It keeps the on-policy generation from PPO but drops the value/critic network. The trick: for each prompt, generate a group of responses, score them all, and use the relative ranking within the group as the training signal. Responses better than average get reinforced. Responses worse than average get suppressed. No separate critic needed.

This is the method behind DeepSeek-R1 and its reasoning capabilities. Watch how it works:

GRPO: Group Scoring

Group Relative Policy Optimization

PromptSolve: 2x + 3 = 11

Press Play to walk through GRPO's group sampling and z-score normalization.

Key Insight

GRPO is especially powerful for tasks where you can automatically check whether an answer is correct. Math problems have right answers. Code either passes tests or it doesn't. The group sampling naturally explores different reasoning strategies, and the correct/incorrect signal is clean. No human preferences needed, no learned reward model, no critic. Just generate, check, and reinforce.

The Alignment Toolkit Today

None of these methods “replaced” RLHF. They are all descendants of the same core idea: learn what humans want by comparing outputs. Each one removes a different component from the original pipeline and makes a different tradeoff.

OpenAI still uses PPO-based RLHF. Anthropic uses Constitutional AI (RLAIF). Meta uses hybrid approaches. DeepSeek uses GRPO for reasoning. The open-source community overwhelmingly uses DPO because it's the simplest to run. Hover over each method to see where it sits:

Alignment Method Landscape

Complex
Simple
DPO
KTO
PPO/RLHF
Constitutional AI
GRPO
SimPO
ORPO
Offline
Online
Open-source / Simple
High-resource / Complex
Lab-developed

Choosing a Method

The right method depends on three things: what kind of feedback data you have, how much compute you can afford, and whether you need the model to explore on its own during training. Walk through this decision tree to find a good starting point:

Alignment Method Selector

Answer each question to find the best post-training method for your setup.

Step 1

What type of feedback data do you have?

Note

This is a starting point, not a prescription. Many production systems combine methods. For example, you might use DPO for initial alignment and then add GRPO for reasoning. Start simple, measure your results, and only add complexity when you have evidence that you need it.

Knowledge Check

10 questions. Test whether you actually understood the tradeoffs, not just whether you read the words. Your answers are saved locally.