Question 1

What is the difference between LoRA and QLoRA?

Accepted Answer

LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into the attention weight matrices of a transformer model. Instead of training all N parameters, only the small A and B matrices are trained, where the update to the original weight W is expressed as W + αBA (with A of shape d×r and B of shape r×k, where r is the rank — typically 4 to 64). QLoRA extends this by first quantising the frozen base model weights to 4-bit NF4 (Normal Float 4) precision using the bitsandbytes library, then applying LoRA adapters in bfloat16. This dramatically reduces the GPU memory required: a 7B parameter model that would require ~28GB in float32 can be fine-tuned with QLoRA on a single 24GB consumer GPU like an RTX 3090.

Question 2

What is the difference between SFT, RLHF, and DPO?

Accepted Answer

Supervised Fine-Tuning (SFT) trains the model on (instruction, response) pairs by minimising the cross-entropy loss on the response tokens. RLHF (Reinforcement Learning from Human Feedback) adds two further stages: training a reward model on human preference data (pairs of responses labelled as chosen or rejected), then using PPO or GRPO to optimise the language model against this reward signal while constraining divergence from the reference model via a KL penalty. DPO (Direct Preference Optimization) bypasses the reward model and PPO entirely by reformulating the RLHF objective directly in terms of the policy's log-probability ratios, training on (chosen, rejected) pairs. DPO is simpler to implement and often achieves comparable results to RLHF at lower cost.

Question 3

How much compute do I need to fine-tune an LLM?

Accepted Answer

With QLoRA, a 7B parameter model can be fine-tuned on a single 24GB GPU (e.g., RTX 3090, RTX 4090, or A10G). A 13B model typically needs a single A100 40GB. Full fine-tuning of a 7B model requires roughly 5 A100 80GB GPUs for float32, or 2–3 for bfloat16. For most product use cases, QLoRA on a rented A10G or L40S (available on AWS, Lambda Labs, or Vast.ai) is the pragmatic approach. UK AI companies typically use AWS or GCP spot instances for batch fine-tuning jobs.

Question 4

How do you evaluate a fine-tuned LLM?

Accepted Answer

Evaluation depends on the task. For instruction-following models: MT-Bench (GPT-4-evaluated multi-turn benchmarks) and AlpacaEval provide comparative rankings. For knowledge tasks: MMLU (57-subject academic benchmark), HellaSwag (commonsense), TruthfulQA (factual accuracy). Perplexity on a held-out set measures whether the model has diverged catastrophically from the base distribution. For production use cases, task-specific evaluation — running the model on your actual inputs and grading outputs — is far more meaningful than general benchmarks. LLM-as-judge approaches (using a capable model to grade outputs) are common for scalable evaluation.

Question 5

What data format is used for SFT fine-tuning?

Accepted Answer

The two dominant formats are Alpaca-style (a JSON array with instruction, optional input, and output fields) and ShareGPT-style (a JSON array with a conversations field containing alternating human and assistant turns). ShareGPT format is preferred for multi-turn conversational fine-tuning. Both are supported natively by libraries like Axolotl and HuggingFace TRL's SFTTrainer. Data is typically stored as JSONL (one JSON object per line) for efficient streaming during training.

Method	Trainable Params	Memory	Best For
LoRA	~1–2% of base	Moderate	General fine-tuning, adapters for multiple tasks
QLoRA	~1–2% of base	Low (4-bit base)	Fine-tuning on consumer GPUs
Full Fine-tuning	100%	Very High	Domain adaptation with large data
Prefix Tuning	<1%	Low	Task-specific adaptation, inference efficiency
DPO	Full or PEFT	Moderate	Preference alignment, safety

Fine-tuning Large Language Models
The 2026 Skills Guide

When to Fine-tune vs When to Use RAG or Prompting

Parameter-Efficient Fine-tuning (PEFT) Methods

The SFT Pipeline in Practice

DPO: Direct Preference Optimization

Libraries and Tools

Common Fine-tuning Mistakes

Frequently Asked Questions

What is the difference between LoRA and QLoRA?

What is the difference between SFT, RLHF, and DPO?

How much compute do I need to fine-tune an LLM?

How do you evaluate a fine-tuned LLM?

What data format is used for SFT fine-tuning?

Browse LLM Engineer Jobs in the UK

Quick Facts

Key Tools

Roles That Need This

Related Skills

Fine-tuning Large Language ModelsThe 2026 Skills Guide

When to Fine-tune vs When to Use RAG or Prompting

Parameter-Efficient Fine-tuning (PEFT) Methods

The SFT Pipeline in Practice

DPO: Direct Preference Optimization

Libraries and Tools

Common Fine-tuning Mistakes

Frequently Asked Questions

What is the difference between LoRA and QLoRA?

What is the difference between SFT, RLHF, and DPO?

How much compute do I need to fine-tune an LLM?

How do you evaluate a fine-tuned LLM?

What data format is used for SFT fine-tuning?

Browse LLM Engineer Jobs in the UK

Quick Facts

Key Tools

Roles That Need This

Related Skills

Fine-tuning Large Language Models
The 2026 Skills Guide