Q: How does GRPO differ from PPO for LLM training?

GRPO (Group Relative Policy Optimization, DeepSeek 2024) is an alternative RL algorithm designed specifically for LLMs that eliminates the value function (critic) network required by PPO. Instead of computing absolute value estimates, GRPO generates multiple responses to each prompt, computes the reward for each, normalises rewards within the group (subtract mean, divide by std), and uses these normalised rewards as advantages. This removes the need to train and maintain a separate critic network (which typically requires the same compute as the policy itself), significantly reducing memory requirements. GRPO was used to train DeepSeek-R1 and has become the standard RL algorithm for reasoning-focused LLM training, where generating multiple chain-of-thought solutions per problem and rewarding correct answers is natural.

Question 1

What is RLHF and why is it used for LLM alignment?

Accepted Answer

RLHF (Reinforcement Learning from Human Feedback) is a technique for aligning LLM behaviour with human preferences. A base language model can generate fluent text but will produce harmful, dishonest, or unhelpful responses without alignment. RLHF addresses this through three stages: (1) SFT — supervised fine-tuning on high-quality instruction-response pairs to teach the model to follow instructions. (2) Reward modelling — train a separate reward model on human preference data (pairs of responses labelled as preferred or dispreferred) to learn to predict human preferences for any given response. (3) RL optimisation — use Proximal Policy Optimization (PPO) to update the language model to maximise the reward model's score, while adding a KL divergence penalty from the reference (SFT) model to prevent the policy from drifting too far and generating degenerate outputs.

Question 2

How does GRPO differ from PPO for LLM training?

Accepted Answer

GRPO (Group Relative Policy Optimization) was introduced by DeepSeek and used to train DeepSeek-R1, which demonstrated that reasoning ability could be substantially improved through RL even without supervised chain-of-thought data.

The GRPO objective: for each prompt, sample G responses { y₁, ..., y_G } from the current policy, compute rewards { r₁, ..., r_G }, normalise to advantages by subtracting the group mean and dividing by the group standard deviation: A_i = (r_i − mean(r)) / std(r). The policy gradient is: L_GRPO = − E[A_i log π_θ(y_i|x)] + β KL(π_θ || π_ref).

Why GRPO works for reasoning: with outcome-based rewards (correct/incorrect for math or coding problems), generating multiple attempts per problem and using relative reward signals is natural. The model learns to produce correct reasoning chains without requiring explicit supervision on intermediate steps. Combined with a format reward (requiring CoT before the final answer), this produces models that reason transparently.

Implementation: HuggingFace TRL's GRPOTrainer (added in TRL v0.8) implements GRPO with configurable reward functions. Can use any callable as a reward function — enabling domain-specific rewards beyond simple correct/incorrect (e.g., code execution results, citation accuracy).

Question 3

What is the difference between RLHF and Constitutional AI?

Accepted Answer

RLHF uses human-labelled preference data to train a reward model and then optimises the policy with PPO. Constitutional AI (Anthropic) reduces the need for human labelling in the RL stage through RLAIF (RL from AI Feedback): instead of human annotators, a capable AI model is used to provide feedback on responses according to a set of written principles (the 'constitution'). Stage 1: Supervised learning — the model generates responses to harmful prompts, then revises them guided by the constitutional principles (self-critique and revision). Stage 2: RL from AI feedback — an AI feedback model (not humans) labels the preference between pairs of responses according to the constitution. The policy is then trained with PPO against this AI-labelled reward model. Constitutional AI scales better than human-labelled RLHF and is the primary alignment approach at Anthropic.

Question 4

How do you collect preference data for reward model training?

Accepted Answer

Preference data consists of (prompt, chosen_response, rejected_response) triplets. Collection methods: (1) Side-by-side human comparison: show annotators two model responses to the same prompt; they select the better one. Platforms: Scale AI, Surge AI, or internal annotation teams. (2) Best-of-N sampling: generate N responses, rank them by some proxy signal, take the top and bottom as chosen and rejected. (3) Red-teaming: specifically generate harmful/unhelpful responses as rejected, high-quality as chosen. Data quality considerations: annotator agreement rates (low agreement = noisy labels; measure inter-annotator agreement with Cohen's kappa), instruction clarity (annotators must have consistent criteria), and avoiding sycophantic preferences (annotators preferring longer or more confident responses regardless of quality).

Method	Compute	Data Required	Quality
SFT only Supervised fine-tuning on instruction-response pairs. Fast, simple, no preference data needed. Limited alignment quality.	Low	Instruction pairs	Baseline
DPO Supervised loss on preference pairs; no reward model or RL. Most commonly used alignment technique today.	Low-Medium	Preference pairs	Excellent
RLHF (PPO) Full three-stage pipeline: SFT → reward model → PPO optimisation. Most flexible but most complex.	Very High	Preference pairs + RM training	Excellent
RLHF (GRPO) PPO variant without value function. Used for reasoning-focused training (math, code). Less memory than PPO.	High	Process rewards or outcome rewards	Excellent for reasoning
RLAIF / CAI AI model provides feedback instead of humans. Scales well; reduces human annotation cost.	High	AI-labelled preferences	Very good
ORPO Combines SFT and preference learning in a single stage. No reference model needed. Efficient.	Low-Medium	Preference pairs	Good

Reinforcement Learning from Human Feedback
RLHF Skills Guide 2026

Why Alignment Matters

Alignment Methods Comparison

The RLHF Pipeline in Depth

GRPO and Reasoning-Focused Training

Frequently Asked Questions

What is RLHF and why is it used for alignment?

How does GRPO differ from PPO?

What is Constitutional AI (RLAIF)?

What is DPO and how does it relate to RLHF?

How do you collect preference data for reward model training?

Browse LLM and AI Research Jobs

Quick Facts

Key Concepts

Related Skills

Reinforcement Learning from Human FeedbackRLHF Skills Guide 2026

Why Alignment Matters

Alignment Methods Comparison

The RLHF Pipeline in Depth

GRPO and Reasoning-Focused Training

Frequently Asked Questions

What is RLHF and why is it used for alignment?

How does GRPO differ from PPO?

What is Constitutional AI (RLAIF)?

What is DPO and how does it relate to RLHF?

How do you collect preference data for reward model training?

Browse LLM and AI Research Jobs

Quick Facts

Key Concepts

Related Skills

Reinforcement Learning from Human Feedback
RLHF Skills Guide 2026