Question 1

How would you design a RAG (retrieval-augmented generation) pipeline for a customer support chatbot?

Accepted Answer

Start by describing the full pipeline: document ingestion → chunking strategy (e.g. 512-token chunks with 50-token overlap) → embedding with a model like text-embedding-3-small → storage in a vector database (Pinecone, Qdrant, or pgvector). At query time: embed the user query → retrieve top-k chunks by cosine similarity → pass them as context to the LLM with a structured prompt. Highlight trade-offs: chunk size affects recall vs. context window usage; reranking (e.g. cross-encoder) improves precision but adds latency. Mention evaluation: measure faithfulness and answer relevance with RAGAS or a custom eval harness.

Question 2

What strategies do you use to reduce LLM latency in a production API?

Accepted Answer

Key strategies: (1) Streaming responses so the user sees tokens as they arrive rather than waiting for the full completion. (2) Caching — semantic caching (e.g. GPTCache) for near-duplicate queries, exact caching for deterministic prompts. (3) Model selection — use a smaller/faster model (e.g. GPT-4o Mini, Claude Haiku) for simple tasks and route complex queries to the capable model. (4) Prompt compression — trim unnecessary context. (5) Speculative decoding for self-hosted models. Always back answers with latency budgets: what's acceptable (200ms for autocomplete vs. 3s for summarisation).

Question 3

Explain how you would evaluate an LLM-based feature before shipping it to production.

Accepted Answer

I'd build a layered evaluation approach. First, offline evals: a curated dataset of golden examples with expected outputs, scored against criteria like factual accuracy, tone, and task completion using both model-based judges (e.g. GPT-4 as evaluator) and rule-based checks. Second, A/B testing in production using metrics that matter to the user (resolution rate for support, click-through for recommendations). Third, ongoing monitoring: logging prompts and completions, tracking refusal rates, latency, and token cost. Avoid shipping without a baseline and at least one human review pass.

Question 4

What is the difference between fine-tuning and prompt engineering, and when would you choose each?

Accepted Answer

Prompt engineering is zero-cost to deploy, fully reversible, and should always be tried first — for behaviour shaping, output formatting, and task decomposition. Fine-tuning is appropriate when: (1) you need consistent stylistic or domain-specific behaviour that can't be reliably achieved via prompting, (2) you need to reduce token cost at scale (a fine-tuned smaller model can outperform a prompted larger one), or (3) you need to reduce prompt length in latency-sensitive applications. Fine-tuning requires labelled examples, compute, and ongoing maintenance — it's a commitment, not a quick fix.

Question 5

How do you handle hallucinations in a production LLM system?

Accepted Answer

Hallucinations can't be eliminated, only managed. Defence-in-depth: (1) Ground responses in retrieved context (RAG) so the model has a source to cite. (2) Add output validation — structured outputs (JSON mode, function calling) prevent format hallucinations. (3) Self-consistency: run the same query multiple times and compare; flag divergent answers for human review. (4) Confidence signalling in the prompt: 'if you are not confident, say so'. (5) Downstream guardrails: check that cited sources actually exist, validate numerical claims programmatically. Monitor refusal rates and user feedback loops.

AI Engineer Interview Questions UK
Technical & Behavioural Guide 2026

The Interview Process

Stage 1: Recruiter screen (30 min)

Stage 2: Technical screen (45–60 min)

Stage 3: Take-home or live coding (1–3 hrs)

Stage 4: System design (45–60 min)

Stage 5: Behavioural / culture (45 min)

Technical Questions

Q1. How would you design a RAG (retrieval-augmented generation) pipeline for a customer support chatbot?

Q2. What strategies do you use to reduce LLM latency in a production API?

Q3. Explain how you would evaluate an LLM-based feature before shipping it to production.

Q4. What is the difference between fine-tuning and prompt engineering, and when would you choose each?

Q5. How do you handle hallucinations in a production LLM system?

Q6. Walk me through how you'd build a model serving infrastructure for a high-traffic inference endpoint.

Q7. What metrics would you use to monitor an AI feature after deployment?

Q8. Describe a situation where you had to debug a non-deterministic bug in an AI pipeline. How did you approach it?

Q9. How do you approach prompt injection risks in a customer-facing LLM product?

Q10. How do you keep up with the pace of change in AI tooling and research?

Behavioural Questions

Tell me about a time an AI feature you built didn't work as expected in production. What did you do?

Describe how you've worked with non-technical stakeholders to define requirements for an AI system.

How do you decide when a problem is worth using AI for vs. a simpler rule-based approach?

Walk me through a technical decision you pushed back on. What was the outcome?

How do you balance moving quickly on AI experiments with maintaining code quality?

Red Flags to Watch For

No model monitoring or alerting

"We'll add evals later"

No clear ownership of the AI system

AI is being added without a clear user problem

No documentation of prompt changes

Preparation Resources

Ready to put this into practice?

AI Engineer Interview Questions UKTechnical & Behavioural Guide 2026