Skills Guide

    Fine-tuning Large Language Models
    The 2026 Skills Guide

    Fine-tuning LLMs is now a core skill for UK AI engineers. This guide covers LoRA, QLoRA, supervised fine-tuning (SFT), DPO, and RLHF — with clear explanations of when to use each, what compute you need, and how to evaluate results.

    When to Fine-tune vs When to Use RAG or Prompting

    Fine-tuning is not always the right answer. Understanding when to use it — and when simpler approaches suffice — is a mark of engineering judgement that interviewers specifically assess.

    • Use prompting first — For many tasks, a well-crafted system prompt with few-shot examples achieves 80–90% of what fine-tuning would achieve, with zero training cost and full flexibility to iterate.
    • Use RAG for knowledge — If the task requires access to specific facts, documents, or data that changes over time, Retrieval-Augmented Generation is almost always preferable to fine-tuning for knowledge injection. Fine-tuning does not reliably encode factual knowledge — it encodes style and behaviour.
    • Fine-tune for style and behaviour — When you need the model to adopt a consistent tone, output format, or domain-specific vocabulary reliably across thousands of calls, fine-tuning is the right tool.
    • Fine-tune for latency/cost — A fine-tuned smaller model (e.g., 7B or 8B) often matches the quality of a prompted larger model on a specific task while being 5–10× cheaper to serve.
    • Fine-tune for safety and alignment — DPO and RLHF are used to make models follow instructions, refuse harmful requests, and provide accurate responses. This is where alignment research meets applied ML engineering.

    Parameter-Efficient Fine-tuning (PEFT) Methods

    MethodTrainable ParamsMemoryBest For
    LoRA~1–2% of baseModerateGeneral fine-tuning, adapters for multiple tasks
    QLoRA~1–2% of baseLow (4-bit base)Fine-tuning on consumer GPUs
    Full Fine-tuning100%Very HighDomain adaptation with large data
    Prefix Tuning<1%LowTask-specific adaptation, inference efficiency
    DPOFull or PEFTModeratePreference alignment, safety

    LoRA in depth: LoRA (Hu et al., 2021) decomposes the weight update ΔW as a product of two low-rank matrices: ΔW = BA, where B ∈ ℝd×r and A ∈ ℝr×k, with rank r ≪ min(d, k). B is initialised to zero and A with a Gaussian; this ensures the output at initialisation is unchanged. A scaling factor α/r is applied. Only A and B are trained; the base weights W are frozen. At inference, the LoRA output can be merged: W' = W + αBA, eliminating any inference overhead. LoRA is typically applied to the query and value projection matrices in each attention layer; applying it to all four (Q, K, V, O) and the FFN layers improves quality further.

    The SFT Pipeline in Practice

    Supervised Fine-Tuning (SFT) on instruction-response pairs is the foundational fine-tuning step. It teaches the model to follow instructions in a specific format and style.

    Step 1: Dataset preparation

    • Data format: JSONL with either Alpaca (instruction/input/output) or ShareGPT (conversations array) structure.
    • Data quality matters far more than quantity. 1,000 high-quality, diverse instruction-response pairs typically outperforms 10,000 noisy ones. Deduplication and filtering for response quality are critical.
    • Apply the model's chat template (tokeniser.apply_chat_template) to format conversations consistently, including system prompts, special tokens (e.g., <|im_start|>, <|im_end|> for ChatML format), and appropriate EOS tokens.

    Step 2: Training with HuggingFace TRL's SFTTrainer

    • SFTTrainer handles dataset processing, data collation with completion-only masking (computing loss only on assistant tokens, not prompt tokens), and LoRA setup via PEFT.
    • Key hyperparameters: learning rate (typically 1e-4 to 2e-4 with LoRA, 1e-5 to 5e-5 for full FT), cosine LR schedule with linear warmup (10% of total steps), batch size adjusted for gradient accumulation, max sequence length.
    • Monitor training loss and validation loss. A gap between them indicates overfitting, particularly common with small datasets (<1,000 examples).

    Step 3: Evaluation

    • Qualitative review: sample 50–100 outputs and manually assess instruction following, factual accuracy, and format compliance.
    • Task-specific automated metrics: F1 for extraction tasks, ROUGE for summarisation, exact match for structured outputs.
    • Regression testing: maintain a benchmark set of known-good prompts and check that fine-tuning has not degraded them.

    DPO: Direct Preference Optimization

    DPO (Rafailov et al., 2023) reformulates the RLHF objective to train directly on preference pairs without a separate reward model or RL training loop. It has largely superseded RLHF for alignment tasks at most companies, due to its simplicity and stability.

    The DPO loss is: LDPOθ) = −E(x,yw,yl)[log σ(β log (πθ(yw|x)/πref(yw|x)) − β log (πθ(yl|x)/πref(yl|x)))]

    Where yw is the preferred (chosen) response, yl is the dispreferred (rejected) response, πref is the frozen reference model (typically the SFT model), and β controls the KL penalty strength (typically 0.1–0.5).

    Practical DPO tips:

    • Always start DPO from an SFT checkpoint — DPO on a base (non-instruction-tuned) model performs poorly.
    • Data quality is critical: chosen responses must be clearly better than rejected for stable training. Ambiguous pairs destabilise convergence.
    • TRL's DPOTrainer handles the reference model and loss computation; it supports LoRA for memory efficiency.
    • Variants: IPO (Identity Preference Optimization) addresses overfitting in DPO; SimPO removes the reference model for further simplification; ORPO combines SFT and preference learning in a single stage.

    Libraries and Tools

    • HuggingFace TRL — The standard library for SFT, DPO, PPO, and GRPO. Integrates directly with HuggingFace Transformers, PEFT, and Accelerate. SFTTrainer and DPOTrainer are the most used classes.
    • Axolotl — YAML-configured fine-tuning framework built on TRL. Handles multi-GPU, DeepSpeed ZeRO, Flash Attention integration, and many data formats out of the box. Popular for rapid experimentation without writing boilerplate.
    • bitsandbytes — Provides NF4 and INT8 quantisation for QLoRA. Must be installed alongside PEFT for QLoRA fine-tuning.
    • PEFT — HuggingFace library providing LoRA, Prefix Tuning, IA³, and other PEFT methods. get_peft_model() wraps any HuggingFace model with LoRA adapters.
    • Unsloth — Optimised fine-tuning kernels that reduce memory usage and increase throughput for popular model families (Llama, Mistral, Phi, Gemma). Significant speed improvements over vanilla TRL on consumer hardware.

    Common Fine-tuning Mistakes

    • Training for too many epochs on small datasets — causes catastrophic overfitting to the fine-tuning distribution
    • Not applying the correct chat template — each model family has a different special token structure; mismatches degrade performance significantly
    • Computing loss on prompt tokens — use completion-only masking (SFTTrainer's DataCollatorForCompletionOnlyLM)
    • Skipping evaluation on held-out data — training loss alone is not a reliable signal
    • Using base models instead of instruct checkpoints as the starting point for DPO

    Frequently Asked Questions

    What is the difference between LoRA and QLoRA?

    LoRA injects trainable low-rank matrices (A of shape d×r and B of shape r×k) into attention weight matrices. Only these small matrices are trained. QLoRA extends this by first quantising the base model to 4-bit NF4 using bitsandbytes, then applying LoRA adapters in bfloat16 — reducing memory to the point where a 7B model can be fine-tuned on a single 24GB GPU.

    What is the difference between SFT, RLHF, and DPO?

    SFT trains on (instruction, response) pairs via cross-entropy loss. RLHF adds a reward model trained on human preference data, then uses PPO/GRPO to optimise the policy against the reward with a KL penalty from the reference model. DPO reformulates the RLHF objective as a supervised loss on (chosen, rejected) pairs, bypassing the reward model entirely. DPO is simpler and often comparable to RLHF.

    How much compute do I need to fine-tune an LLM?

    With QLoRA: 7B model fits on a single RTX 3090/4090 (24GB). 13B needs an A100 40GB. Full fine-tuning of a 7B model requires ~2–3 A100 80GB GPUs in bfloat16. UK companies typically use AWS or GCP spot instances for batch fine-tuning jobs.

    How do you evaluate a fine-tuned LLM?

    General: MT-Bench, AlpacaEval, MMLU, HellaSwag, TruthfulQA. Perplexity on held-out data. For production: task-specific evaluation on your actual inputs, LLM-as-judge for scalable grading. General benchmarks rarely correlate with task-specific performance.

    What data format is used for SFT fine-tuning?

    Alpaca-style JSON (instruction, input, output fields) for single-turn. ShareGPT-style (conversations with human/assistant turns) for multi-turn. Data stored as JSONL for efficient streaming. Both are natively supported by TRL's SFTTrainer and Axolotl.

    Browse LLM Engineer Jobs in the UK

    Find LLM engineering and fine-tuning roles at UK AI companies.

    Quick Facts

    Demand level
    Very High
    Difficulty
    Advanced
    Time to proficiency6–12 months
    Salary premium+£15,000–£30,000

    Key Tools

    TRL
    PEFT
    LoRA
    QLoRA
    DPO
    Axolotl
    Unsloth
    bitsandbytes