Question 1

What are the key design decisions when building a RAG system, and how do you evaluate whether it's working?

Accepted Answer

Key decisions: (1) Chunking strategy — chunk size and overlap affect recall vs. context window usage. Smaller chunks improve precision; larger improve coherence. (2) Embedding model — OpenAI text-embedding-3-small vs. open-source alternatives (e5-large, BGE). Test on your domain. (3) Retrieval — dense (cosine similarity), sparse (BM25), or hybrid. Hybrid often beats either alone. (4) Reranking — a cross-encoder reranker (e.g. Cohere Rerank) improves precision after initial retrieval. Evaluation: use RAGAS to measure faithfulness (does the answer match the retrieved context?), answer relevance (does it address the question?), and context recall (did retrieval surface the right chunks?). Build a golden dataset with ground-truth Q&A pairs and test every pipeline change against it.

Question 2

When would you fine-tune an LLM rather than use prompt engineering or RAG?

Accepted Answer

Fine-tuning is justified when: (1) You need consistent domain-specific output style or vocabulary that can't be reliably induced through prompting. (2) You're making many API calls and a fine-tuned smaller model (e.g. fine-tuned GPT-4o Mini) can match a prompted larger model at lower cost and latency. (3) Your use case has a structured output format that a smaller model struggles to follow even with few-shot examples. Avoid fine-tuning to inject knowledge — it's fragile and expensive; use RAG for knowledge grounding. Also avoid fine-tuning when you need the ability to update knowledge frequently. The practical order is: prompt engineering → RAG → fine-tuning.

Question 3

How do you measure and improve the factual accuracy of an LLM system?

Accepted Answer

Factual accuracy is distinct from answer quality. Measure it by: (1) Grounding rate — what percentage of claims in the output can be traced to a retrieved source? (2) Citation accuracy — if the model cites a source, does the cited source actually support the claim? (3) Hallucination rate — use a model-as-judge approach or a dedicated hallucination detector. To improve: increase retrieval quality so the model has correct information to work from; use explicit grounding instructions ('only use information from the provided context'); add a self-consistency pass; validate numerical and date claims programmatically. Monitor in production via sampling + human review.

Question 4

Explain the difference between instruction fine-tuning and RLHF. When is each appropriate?

Accepted Answer

Instruction fine-tuning (SFT) trains the model on (instruction, response) pairs to make it follow user instructions reliably. It's supervised and relatively straightforward to implement. RLHF (Reinforcement Learning from Human Feedback) adds a reward model trained on human preference comparisons, then uses RL (typically PPO) to further align the model toward preferred responses. RLHF is more expensive and complex but produces better alignment with subjective human preferences (helpfulness, harmlessness). For most production use cases, SFT with careful data curation is sufficient. RLHF is used primarily by model providers improving foundation models. DPO (Direct Preference Optimization) is a simpler alternative to full RLHF for practitioners.

Question 5

What strategies do you use to manage and reduce LLM token costs at scale?

Accepted Answer

Token cost is a product decision as much as an engineering one. Strategies: (1) Model routing — classify query complexity and route simple queries to a cheap model (Claude Haiku, GPT-4o Mini) and complex ones to a capable model. (2) Prompt compression — remove unnecessary context, use shorter system prompts. (3) Caching — exact-match cache for repeated prompts, semantic cache (GPTCache) for near-duplicates. (4) Batching — group requests where latency allows. (5) Output length control — set max_tokens appropriately; streaming with early stopping for user-facing interfaces. Track cost per query and per user segment so you know where to optimise.

LLM Engineer Interview Questions UK
Technical & Behavioural Guide 2026

The Interview Process

Stage 1: Recruiter screen (30 min)

Stage 2: Technical screen (45–60 min)

Stage 3: LLM system design (45–60 min)

Stage 4: Coding or take-home (1–3 hrs)

Stage 5: Behavioural (45 min)

Technical Questions

Q1. What are the key design decisions when building a RAG system, and how do you evaluate whether it's working?

Q2. When would you fine-tune an LLM rather than use prompt engineering or RAG?

Q3. How do you measure and improve the factual accuracy of an LLM system?

Q4. Explain the difference between instruction fine-tuning and RLHF. When is each appropriate?

Q5. What strategies do you use to manage and reduce LLM token costs at scale?

Q6. How would you build an LLM evaluation framework for a summarisation product?

Q7. What are vector databases, and how do you choose between them?

Q8. How do you handle long context in LLM applications where documents exceed the model's context window?

Q9. Explain how you'd implement a function-calling / tool-use pattern in a production LLM agent.

Q10. What is the 'lost in the middle' problem in LLMs and how does it affect system design?

Behavioural Questions

Describe a GenAI feature you shipped that didn't perform as expected. What did the evaluation miss?

How do you manage stakeholder expectations when LLM system behaviour is non-deterministic?

Walk me through how you decided on the architecture of a recent LLM system. What alternatives did you consider?

How do you stay on top of the pace of change in the LLM ecosystem without being distracted by every new release?

Tell me about a time you had to balance model capability with data privacy or security requirements.

Red Flags to Watch For

No evaluation framework in place

Using LLMs for everything

No guardrails or output validation

Token costs not tracked

Models deployed without versioning

Preparation Resources

Ready to apply?

LLM Engineer Interview Questions UKTechnical & Behavioural Guide 2026