When to Fine-tune vs When to Use RAG or Prompting
Fine-tuning is not always the right answer. Understanding when to use it — and when simpler approaches suffice — is a mark of engineering judgement that interviewers specifically assess.
- Use prompting first — For many tasks, a well-crafted system prompt with few-shot examples achieves 80–90% of what fine-tuning would achieve, with zero training cost and full flexibility to iterate.
- Use RAG for knowledge — If the task requires access to specific facts, documents, or data that changes over time, Retrieval-Augmented Generation is almost always preferable to fine-tuning for knowledge injection. Fine-tuning does not reliably encode factual knowledge — it encodes style and behaviour.
- Fine-tune for style and behaviour — When you need the model to adopt a consistent tone, output format, or domain-specific vocabulary reliably across thousands of calls, fine-tuning is the right tool.
- Fine-tune for latency/cost — A fine-tuned smaller model (e.g., 7B or 8B) often matches the quality of a prompted larger model on a specific task while being 5–10× cheaper to serve.
- Fine-tune for safety and alignment — DPO and RLHF are used to make models follow instructions, refuse harmful requests, and provide accurate responses. This is where alignment research meets applied ML engineering.
Parameter-Efficient Fine-tuning (PEFT) Methods
| Method | Trainable Params | Memory | Best For |
|---|---|---|---|
| LoRA | ~1–2% of base | Moderate | General fine-tuning, adapters for multiple tasks |
| QLoRA | ~1–2% of base | Low (4-bit base) | Fine-tuning on consumer GPUs |
| Full Fine-tuning | 100% | Very High | Domain adaptation with large data |
| Prefix Tuning | <1% | Low | Task-specific adaptation, inference efficiency |
| DPO | Full or PEFT | Moderate | Preference alignment, safety |
LoRA in depth: LoRA (Hu et al., 2021) decomposes the weight update ΔW as a product of two low-rank matrices: ΔW = BA, where B ∈ ℝd×r and A ∈ ℝr×k, with rank r ≪ min(d, k). B is initialised to zero and A with a Gaussian; this ensures the output at initialisation is unchanged. A scaling factor α/r is applied. Only A and B are trained; the base weights W are frozen. At inference, the LoRA output can be merged: W' = W + αBA, eliminating any inference overhead. LoRA is typically applied to the query and value projection matrices in each attention layer; applying it to all four (Q, K, V, O) and the FFN layers improves quality further.
The SFT Pipeline in Practice
Supervised Fine-Tuning (SFT) on instruction-response pairs is the foundational fine-tuning step. It teaches the model to follow instructions in a specific format and style.
Step 1: Dataset preparation
- Data format: JSONL with either Alpaca (instruction/input/output) or ShareGPT (conversations array) structure.
- Data quality matters far more than quantity. 1,000 high-quality, diverse instruction-response pairs typically outperforms 10,000 noisy ones. Deduplication and filtering for response quality are critical.
- Apply the model's chat template (tokeniser.apply_chat_template) to format conversations consistently, including system prompts, special tokens (e.g., <|im_start|>, <|im_end|> for ChatML format), and appropriate EOS tokens.
Step 2: Training with HuggingFace TRL's SFTTrainer
- SFTTrainer handles dataset processing, data collation with completion-only masking (computing loss only on assistant tokens, not prompt tokens), and LoRA setup via PEFT.
- Key hyperparameters: learning rate (typically 1e-4 to 2e-4 with LoRA, 1e-5 to 5e-5 for full FT), cosine LR schedule with linear warmup (10% of total steps), batch size adjusted for gradient accumulation, max sequence length.
- Monitor training loss and validation loss. A gap between them indicates overfitting, particularly common with small datasets (<1,000 examples).
Step 3: Evaluation
- Qualitative review: sample 50–100 outputs and manually assess instruction following, factual accuracy, and format compliance.
- Task-specific automated metrics: F1 for extraction tasks, ROUGE for summarisation, exact match for structured outputs.
- Regression testing: maintain a benchmark set of known-good prompts and check that fine-tuning has not degraded them.
DPO: Direct Preference Optimization
DPO (Rafailov et al., 2023) reformulates the RLHF objective to train directly on preference pairs without a separate reward model or RL training loop. It has largely superseded RLHF for alignment tasks at most companies, due to its simplicity and stability.
The DPO loss is: LDPO(πθ) = −E(x,yw,yl)[log σ(β log (πθ(yw|x)/πref(yw|x)) − β log (πθ(yl|x)/πref(yl|x)))]
Where yw is the preferred (chosen) response, yl is the dispreferred (rejected) response, πref is the frozen reference model (typically the SFT model), and β controls the KL penalty strength (typically 0.1–0.5).
Practical DPO tips:
- Always start DPO from an SFT checkpoint — DPO on a base (non-instruction-tuned) model performs poorly.
- Data quality is critical: chosen responses must be clearly better than rejected for stable training. Ambiguous pairs destabilise convergence.
- TRL's
DPOTrainerhandles the reference model and loss computation; it supports LoRA for memory efficiency. - Variants: IPO (Identity Preference Optimization) addresses overfitting in DPO; SimPO removes the reference model for further simplification; ORPO combines SFT and preference learning in a single stage.
Libraries and Tools
- HuggingFace TRL — The standard library for SFT, DPO, PPO, and GRPO. Integrates directly with HuggingFace Transformers, PEFT, and Accelerate. SFTTrainer and DPOTrainer are the most used classes.
- Axolotl — YAML-configured fine-tuning framework built on TRL. Handles multi-GPU, DeepSpeed ZeRO, Flash Attention integration, and many data formats out of the box. Popular for rapid experimentation without writing boilerplate.
- bitsandbytes — Provides NF4 and INT8 quantisation for QLoRA. Must be installed alongside PEFT for QLoRA fine-tuning.
- PEFT — HuggingFace library providing LoRA, Prefix Tuning, IA³, and other PEFT methods.
get_peft_model()wraps any HuggingFace model with LoRA adapters. - Unsloth — Optimised fine-tuning kernels that reduce memory usage and increase throughput for popular model families (Llama, Mistral, Phi, Gemma). Significant speed improvements over vanilla TRL on consumer hardware.
Common Fine-tuning Mistakes
- Training for too many epochs on small datasets — causes catastrophic overfitting to the fine-tuning distribution
- Not applying the correct chat template — each model family has a different special token structure; mismatches degrade performance significantly
- Computing loss on prompt tokens — use completion-only masking (SFTTrainer's DataCollatorForCompletionOnlyLM)
- Skipping evaluation on held-out data — training loss alone is not a reliable signal
- Using base models instead of instruct checkpoints as the starting point for DPO
Frequently Asked Questions
What is the difference between LoRA and QLoRA?
LoRA injects trainable low-rank matrices (A of shape d×r and B of shape r×k) into attention weight matrices. Only these small matrices are trained. QLoRA extends this by first quantising the base model to 4-bit NF4 using bitsandbytes, then applying LoRA adapters in bfloat16 — reducing memory to the point where a 7B model can be fine-tuned on a single 24GB GPU.
What is the difference between SFT, RLHF, and DPO?
SFT trains on (instruction, response) pairs via cross-entropy loss. RLHF adds a reward model trained on human preference data, then uses PPO/GRPO to optimise the policy against the reward with a KL penalty from the reference model. DPO reformulates the RLHF objective as a supervised loss on (chosen, rejected) pairs, bypassing the reward model entirely. DPO is simpler and often comparable to RLHF.
How much compute do I need to fine-tune an LLM?
With QLoRA: 7B model fits on a single RTX 3090/4090 (24GB). 13B needs an A100 40GB. Full fine-tuning of a 7B model requires ~2–3 A100 80GB GPUs in bfloat16. UK companies typically use AWS or GCP spot instances for batch fine-tuning jobs.
How do you evaluate a fine-tuned LLM?
General: MT-Bench, AlpacaEval, MMLU, HellaSwag, TruthfulQA. Perplexity on held-out data. For production: task-specific evaluation on your actual inputs, LLM-as-judge for scalable grading. General benchmarks rarely correlate with task-specific performance.
What data format is used for SFT fine-tuning?
Alpaca-style JSON (instruction, input, output fields) for single-turn. ShareGPT-style (conversations with human/assistant turns) for multi-turn. Data stored as JSONL for efficient streaming. Both are natively supported by TRL's SFTTrainer and Axolotl.