Reinforcement Learning from Human Feedback
RLHF Skills Guide 2026
RLHF and its successors (DPO, GRPO, RLAIF) are the core techniques for aligning LLMs with human preferences. This guide covers the full pipeline — from reward model training and PPO to the simpler DPO and Constitutional AI approaches — with practical guidance on when to use each.
Why Alignment Matters
Pre-trained language models learn to predict the next token from internet text — they are not trained to be helpful, honest, or harmless. A raw pre-trained model will happily generate detailed instructions for harmful activities, agree with false premises, or produce misleading content, because these patterns exist in web data.
Alignment techniques teach models to produce outputs that humans prefer — responses that are helpful, safe, factually accurate, and follow the user's actual intent. The canonical alignment pipeline for modern LLMs has converged around three phases: supervised fine-tuning (SFT) to establish instruction-following capability, preference learning (RLHF, DPO, or a variant) to reinforce human-preferred behaviour, and ongoing safety evaluations and red-teaming to identify gaps.
Understanding alignment is increasingly important for UK AI engineers, not just researchers. Engineers building applications on top of LLMs need to understand what alignment guarantees (and what it doesn't), and engineers at LLM companies contribute to alignment pipelines as core product development.
Alignment Methods Comparison
| Method | Compute | Data Required | Quality |
|---|---|---|---|
SFT only Supervised fine-tuning on instruction-response pairs. Fast, simple, no preference data needed. Limited alignment quality. | Low | Instruction pairs | Baseline |
DPO Supervised loss on preference pairs; no reward model or RL. Most commonly used alignment technique today. | Low-Medium | Preference pairs | Excellent |
RLHF (PPO) Full three-stage pipeline: SFT → reward model → PPO optimisation. Most flexible but most complex. | Very High | Preference pairs + RM training | Excellent |
RLHF (GRPO) PPO variant without value function. Used for reasoning-focused training (math, code). Less memory than PPO. | High | Process rewards or outcome rewards | Excellent for reasoning |
RLAIF / CAI AI model provides feedback instead of humans. Scales well; reduces human annotation cost. | High | AI-labelled preferences | Very good |
ORPO Combines SFT and preference learning in a single stage. No reference model needed. Efficient. | Low-Medium | Preference pairs | Good |
The RLHF Pipeline in Depth
Stage 1: Supervised Fine-Tuning (SFT)
SFT fine-tunes a pre-trained base model on a dataset of high-quality (instruction, response) pairs. The loss is standard cross-entropy on the response tokens (prompt tokens are masked). The SFT model learns to follow instructions and establishes the distribution that Stage 3 will optimise around. Quality and diversity of the SFT dataset matters enormously — 1,000 diverse, high-quality pairs often outperforms 10,000 noisy pairs.
Stage 2: Reward Model Training
The reward model (RM) is typically a language model fine-tuned on preference data. Architecture: take the SFT model, replace the language model head with a scalar regression head (single output neuron). Train on (prompt, chosen, rejected) triplets using the Bradley-Terry preference model: the loss maximises log σ(r(chosen) − r(rejected)), where r is the reward model's scalar score. At convergence, the RM assigns higher scores to responses humans prefer. RM quality is critical — reward hacking occurs when the policy learns to exploit RM weaknesses, generating responses that score highly but are not actually preferred by humans.
Stage 3: PPO Optimisation
PPO updates the language model (the "policy") to maximise the reward model's score, subject to a KL divergence penalty from the reference model (the SFT checkpoint): L = r_RM(response) − β KL(π_θ || π_ref). The KL penalty prevents the policy from drifting far from the SFT distribution, which would cause mode collapse and incoherent text. β (typically 0.01–0.1) controls this trade-off. The value function (critic) is trained jointly, requiring ~2× the compute of the policy alone.
GRPO and Reasoning-Focused Training
GRPO (Group Relative Policy Optimization) was introduced by DeepSeek and used to train DeepSeek-R1, which demonstrated that reasoning ability could be substantially improved through RL even without supervised chain-of-thought data.
The GRPO objective: for each prompt, sample G responses { y₁, ..., y_G } from the current policy, compute rewards { r₁, ..., r_G }, normalise to advantages by subtracting the group mean and dividing by the group standard deviation: A_i = (r_i − mean(r)) / std(r). The policy gradient is: L_GRPO = − E[A_i log π_θ(y_i|x)] + β KL(π_θ || π_ref).
Why GRPO works for reasoning: with outcome-based rewards (correct/incorrect for math or coding problems), generating multiple attempts per problem and using relative reward signals is natural. The model learns to produce correct reasoning chains without requiring explicit supervision on intermediate steps. Combined with a format reward (requiring CoT before the final answer), this produces models that reason transparently.
Implementation: HuggingFace TRL's GRPOTrainer (added in TRL v0.8) implements GRPO with configurable reward functions. Can use any callable as a reward function — enabling domain-specific rewards beyond simple correct/incorrect (e.g., code execution results, citation accuracy).
Frequently Asked Questions
What is RLHF and why is it used for alignment?
RLHF trains LLMs to follow human preferences. Three stages: (1) SFT on instruction-response pairs. (2) Reward model training on (prompt, chosen, rejected) preference pairs. (3) PPO/GRPO to maximise reward while a KL penalty prevents drifting from the reference (SFT) model. Produces helpful, harmless, honest models from a capable but misaligned base model.
How does GRPO differ from PPO?
GRPO (DeepSeek 2024) eliminates the value function (critic) network. Instead, it generates multiple responses per prompt, computes rewards, normalises within the group (subtract mean/std), and uses these as advantages. No separate critic — saves ~50% memory. Used in DeepSeek-R1. Ideal for reasoning tasks where generating multiple solutions and rewarding correct ones is natural.
What is Constitutional AI (RLAIF)?
Constitutional AI (Anthropic) replaces human preference labellers with AI feedback. Stage 1: SL-CAI — model self-critiques and revises responses guided by written principles (the constitution). Stage 2: RLAIF — AI model labels preferences between response pairs per the constitution; policy trained with PPO against this AI reward model. Scales better than human-labelled RLHF.
What is DPO and how does it relate to RLHF?
DPO (Direct Preference Optimization) reformulates the RLHF objective as a supervised loss on (chosen, rejected) pairs without a reward model or PPO. The loss directly increases log-probability of chosen relative to the reference model and decreases rejected relative to the reference. Simpler, more stable than PPO-RLHF. Most teams now use DPO instead of RLHF for alignment — see also IPO, SimPO, ORPO as further DPO variants.
How do you collect preference data for reward model training?
Side-by-side human comparison (annotators choose the better of two responses — Scale AI, Surge AI, or internal teams). Best-of-N sampling (generate N responses, use top/bottom as chosen/rejected). Red-teaming. Quality: measure inter-annotator agreement (Cohen's kappa), write clear annotation guidelines, watch for sycophantic preferences (longer = better bias).