Interview Prep

    LLM Engineer Interview Questions UK
    Technical & Behavioural Guide 2026

    10 technical questions covering RAG pipelines, fine-tuning, evaluation, and GenAI system design — with strong example answers, behavioural prep, and what to watch for in an employer.

    The Interview Process

    Stage 1: Recruiter screen (30 min)

    Background and fit. Expect a question about a specific LLM project you've shipped, including what you measured and how it performed.

    Stage 2: Technical screen (45–60 min)

    GenAI fundamentals, RAG concepts, and prompt engineering. May include a live coding exercise building a small LLM pipeline.

    Stage 3: LLM system design (45–60 min)

    Design a production GenAI system — often a RAG chatbot, a content generation pipeline, or an agent. Depth on evaluation and monitoring is key.

    Stage 4: Coding or take-home (1–3 hrs)

    Build a working LLM feature. Quality of eval harness and error handling often differentiates candidates more than model performance.

    Stage 5: Behavioural (45 min)

    Questions about shipping GenAI features to real users, managing non-determinism, and cross-functional collaboration.

    Technical Questions

    Write your own answer first, then compare against the example.

    Q1. What are the key design decisions when building a RAG system, and how do you evaluate whether it's working?

    Strong answer

    Key decisions: (1) Chunking strategy — chunk size and overlap affect recall vs. context window usage. Smaller chunks improve precision; larger improve coherence. (2) Embedding model — OpenAI text-embedding-3-small vs. open-source alternatives (e5-large, BGE). Test on your domain. (3) Retrieval — dense (cosine similarity), sparse (BM25), or hybrid. Hybrid often beats either alone. (4) Reranking — a cross-encoder reranker (e.g. Cohere Rerank) improves precision after initial retrieval. Evaluation: use RAGAS to measure faithfulness (does the answer match the retrieved context?), answer relevance (does it address the question?), and context recall (did retrieval surface the right chunks?). Build a golden dataset with ground-truth Q&A pairs and test every pipeline change against it.

    Q2. When would you fine-tune an LLM rather than use prompt engineering or RAG?

    Strong answer

    Fine-tuning is justified when: (1) You need consistent domain-specific output style or vocabulary that can't be reliably induced through prompting. (2) You're making many API calls and a fine-tuned smaller model (e.g. fine-tuned GPT-4o Mini) can match a prompted larger model at lower cost and latency. (3) Your use case has a structured output format that a smaller model struggles to follow even with few-shot examples. Avoid fine-tuning to inject knowledge — it's fragile and expensive; use RAG for knowledge grounding. Also avoid fine-tuning when you need the ability to update knowledge frequently. The practical order is: prompt engineering → RAG → fine-tuning.

    Q3. How do you measure and improve the factual accuracy of an LLM system?

    Strong answer

    Factual accuracy is distinct from answer quality. Measure it by: (1) Grounding rate — what percentage of claims in the output can be traced to a retrieved source? (2) Citation accuracy — if the model cites a source, does the cited source actually support the claim? (3) Hallucination rate — use a model-as-judge approach or a dedicated hallucination detector. To improve: increase retrieval quality so the model has correct information to work from; use explicit grounding instructions ('only use information from the provided context'); add a self-consistency pass; validate numerical and date claims programmatically. Monitor in production via sampling + human review.

    Q4. Explain the difference between instruction fine-tuning and RLHF. When is each appropriate?

    Strong answer

    Instruction fine-tuning (SFT) trains the model on (instruction, response) pairs to make it follow user instructions reliably. It's supervised and relatively straightforward to implement. RLHF (Reinforcement Learning from Human Feedback) adds a reward model trained on human preference comparisons, then uses RL (typically PPO) to further align the model toward preferred responses. RLHF is more expensive and complex but produces better alignment with subjective human preferences (helpfulness, harmlessness). For most production use cases, SFT with careful data curation is sufficient. RLHF is used primarily by model providers improving foundation models. DPO (Direct Preference Optimization) is a simpler alternative to full RLHF for practitioners.

    Q5. What strategies do you use to manage and reduce LLM token costs at scale?

    Strong answer

    Token cost is a product decision as much as an engineering one. Strategies: (1) Model routing — classify query complexity and route simple queries to a cheap model (Claude Haiku, GPT-4o Mini) and complex ones to a capable model. (2) Prompt compression — remove unnecessary context, use shorter system prompts. (3) Caching — exact-match cache for repeated prompts, semantic cache (GPTCache) for near-duplicates. (4) Batching — group requests where latency allows. (5) Output length control — set max_tokens appropriately; streaming with early stopping for user-facing interfaces. Track cost per query and per user segment so you know where to optimise.

    Q6. How would you build an LLM evaluation framework for a summarisation product?

    Strong answer

    Summarisation evaluation needs both reference-based and reference-free metrics. Reference-based: ROUGE-L (n-gram overlap with human summaries) and BERTScore (semantic similarity). Reference-free (preferred in practice): use an LLM judge to evaluate faithfulness (does the summary only contain information from the source?), conciseness (is it appropriately short?), and completeness (are key points covered?). Build a golden dataset of 100–200 (source, good summary) pairs, run evals on every model or prompt change. Add human review for a random sample weekly. Define what 'regression' means numerically — e.g. faithfulness below 0.90 is a blocking issue.

    Q7. What are vector databases, and how do you choose between them?

    Strong answer

    Vector databases store embedding vectors and support efficient approximate nearest-neighbour (ANN) search. Popular options: Pinecone (managed, easy to start, expensive at scale), Qdrant (open-source, self-hostable, strong filtering), Weaviate (schema-first, multi-modal), pgvector (PostgreSQL extension — great if you already use Postgres). Key selection criteria: (1) Filtering support — can you filter by metadata before or during vector search? (2) Scale — how many vectors and queries per second? (3) Hosting model — do you need on-prem for data privacy? (4) Hybrid search — does it combine dense + sparse retrieval natively? For most early-stage products, pgvector is sufficient and reduces infrastructure complexity.

    Q8. How do you handle long context in LLM applications where documents exceed the model's context window?

    Strong answer

    Several approaches depending on use case: (1) Chunking + RAG — split the document and retrieve only relevant chunks. Best for Q&A over large corpora. (2) Map-reduce — process each chunk independently (map step) and synthesise results (reduce step). Good for summarisation. (3) Sliding window with overlap — process sequential chunks with context overlap for continuity. (4) Recursive summarisation — iteratively summarise large documents before passing to the model. (5) Use models with very long context windows (Claude 200k, Gemini 1M tokens) — but beware that retrieval performance degrades with very long contexts ('lost in the middle' problem). Always benchmark the chosen approach on your specific task.

    Q9. Explain how you'd implement a function-calling / tool-use pattern in a production LLM agent.

    Strong answer

    Define tools as structured JSON schemas (name, description, parameters). Pass the schema in the API request. The model returns either a direct response or a tool_call with the function name and arguments. Your code executes the function and returns the result as a tool message. The model then uses the result to form its final response. Production considerations: (1) Validate all tool arguments before execution — don't trust the model's parameter values. (2) Set timeouts and error handling on tool calls. (3) Limit tool permissions — principle of least privilege. (4) Log all tool calls with input/output for debugging. (5) Rate-limit recursive tool calling to prevent loops. Test with adversarial inputs to find jailbreaks that misuse tools.

    Q10. What is the 'lost in the middle' problem in LLMs and how does it affect system design?

    Strong answer

    Research (Liu et al., 2023) showed that LLMs perform worse when relevant information appears in the middle of a long context compared to the beginning or end. This has practical implications for RAG: retrieved chunks placed in the middle of a long context window may be ignored. Mitigations: (1) Place the most relevant chunks at the start or end of the context. (2) Use shorter, more focused contexts rather than stuffing the full window. (3) Rerank retrieved chunks and order them by relevance descending (highest relevance first). (4) Use models specifically trained for long-context retrieval (e.g. models trained with positional interpolation). Test your specific use case — the effect varies by model and task.

    Behavioural Questions

    Use STAR format and keep answers to 2–3 minutes.

    Describe a GenAI feature you shipped that didn't perform as expected. What did the evaluation miss?

    Shows that you have shipped LLM features to real users and have a mature view of where offline evals diverge from production.

    How do you manage stakeholder expectations when LLM system behaviour is non-deterministic?

    A practical and common challenge. Demonstrate how you set up feedback loops, communicate confidence, and set realistic SLAs.

    Walk me through how you decided on the architecture of a recent LLM system. What alternatives did you consider?

    Tests depth of experience. The strongest candidates can articulate trade-offs they actually evaluated, not just solutions they read about.

    How do you stay on top of the pace of change in the LLM ecosystem without being distracted by every new release?

    Shows information diet and signal/noise filtering. The best engineers experiment with a few things deeply rather than knowing many things shallowly.

    Tell me about a time you had to balance model capability with data privacy or security requirements.

    Common tension in enterprise LLM projects. Show you've thought about where data leaves the company and how you mitigated risks.

    Red Flags to Watch For

    No evaluation framework in place

    If there's no systematic way to measure whether LLM outputs are good, every prompt change is a gamble.

    Using LLMs for everything

    A sign of novelty-driven rather than problem-driven engineering. Ask how they decide when not to use an LLM.

    No guardrails or output validation

    Production LLM systems need to handle refusals, malformed outputs, and prompt injection. If there's no plan for this, reliability will suffer.

    Token costs not tracked

    If the team doesn't monitor cost per request, the product will be unprofitable at scale. This is a unit economics problem.

    Models deployed without versioning

    Model providers update models without notice. Without pinned model versions, prompt behaviour can change in production unexpectedly.

    Preparation Resources

    RAGAS documentation

    RAG evaluation framework — essential to understand before any LLM interview

    LangChain documentation

    Industry-standard LLM orchestration — know the core abstractions

    OpenAI Cookbook

    Practical examples of function calling, RAG, and fine-tuning patterns

    Anthropic Prompt Engineering Guide

    The most thorough prompting guide from a model provider

    LLM Engineer Salary Guide

    Understand compensation before negotiating

    Ready to apply?

    Browse live LLM engineer roles across the UK — updated daily.