NLP interviews have a specific technical depth that general ML interviews don't always require. Understanding exactly what each stage tests — and what a strong answer looks like — is what converts good candidates into offers.
The NLP Interview Structure
UK NLP engineering interviews typically run 4–5 stages, with the NLP theory interview being the defining stage that separates specialist candidates from generalists. The process is broadly similar to ML engineering interviews with the addition of NLP-specific depth in the technical round.
Stage 1 — Recruiter screen: Background, motivations, expectations. Prepare a clear narrative about your NLP experience and why you're interested in this specific domain.
Stage 2 — Online coding assessment: Python algorithmic problems, medium difficulty. The same preparation applies as for any ML or SWE role — LeetCode medium problems focusing on strings, data structures, and basic algorithms. Sometimes includes a text processing task.
Stage 3 — NLP technical interview: The core round. 60–90 minutes with 1–2 NLP engineers. Covers transformer architecture, NLP task types, evaluation, text preprocessing, and often a live coding task. This is where preparation on the specific NLP topics below pays off.
Stage 4 — Take-home challenge: 3–6 hours. Fine-tune a model, build a pipeline, or evaluate an NLP system on a provided dataset. Quality and clarity of reasoning matter more than state-of-the-art performance.
Stage 5 — Final loop: System design, culture fit, team meeting. System design prompts for NLP are typically large-scale text processing or classification system designs.
Core NLP Theory Questions
Transformer and architecture questions
- Explain the attention mechanism in transformers. Why does it work better than RNNs for long documents?
Strong answer: attention allows the model to relate any position in the sequence to any other position in a single computation, regardless of distance. RNNs suffer from vanishing gradients over long sequences. Self-attention scales as O(n²) in sequence length, which is why there are efforts to reduce this for very long contexts. - What is the difference between BERT and GPT architectures?
BERT is encoder-only, pre-trained with masked language modelling (bidirectional). GPT is decoder-only, pre-trained with causal language modelling (left-to-right). BERT is better for classification and understanding tasks; GPT-family for generation. BERT fine-tunes well for NLP tasks; instruction-tuned GPT-family is the basis for modern LLMs. - What is tokenisation and what are the trade-offs between different tokenisation strategies?
Character-level: robust to novel words, long sequences. Word-level: efficient but poor OOV handling. Subword (BPE, WordPiece, SentencePiece): balance between vocabulary size and sequence length. BPE builds vocabulary by merging frequent character pairs; WordPiece chooses merges that maximise language model likelihood. Tokenisation choices affect how well models handle rare words, morphologically rich languages, and domain-specific vocabulary. - What is the difference between fine-tuning and prompt engineering for adapting a pre-trained model?
Fine-tuning updates model weights for the specific task. Better for well-defined tasks with labelled training data; more reliable performance on specific inputs. Prompt engineering doesn't update weights; works through in-context examples and instruction design. Faster and cheaper, but less reliable for precise tasks. Parameter-efficient fine-tuning (LoRA, adapters) is a middle ground: updates a small number of parameters efficiently.
NLP Task and Evaluation Questions
- How would you evaluate a Named Entity Recognition (NER) model?
Precision (of predicted entities, how many are correct), recall (of true entities, how many were found), F1 score. Important nuance: exact match vs partial match evaluation. Exact match requires the full span to match; partial match gives credit for overlapping spans. For production NER, also consider entity-type breakdown — model may be excellent at PERSON but poor at ORG. Always evaluate on a held-out test set, not validation. - How do you handle class imbalance in text classification?
Oversampling rare classes (with augmentation), undersampling majority class, class-weighted loss function, focal loss (concentrates on hard examples), threshold adjustment. Evaluate with macro-F1 rather than accuracy when classes are imbalanced. For extreme imbalance, consider whether classification is even the right framing — anomaly detection may be better for rare positive events. - What are the limitations of BLEU and ROUGE for evaluating generated text?
Both measure n-gram overlap with reference text. BLEU penalises short outputs heavily; ROUGE has variants for recall (ROUGE-R) and F1 (ROUGE-F1). Limitations: don't capture semantic similarity (a perfect paraphrase with different words scores low), don't capture factual accuracy, poor correlation with human judgment for abstractive summarisation. BERTScore uses contextual embeddings for better semantic alignment. LLM-as-a-judge is increasingly used to evaluate factuality and coherence. - What is domain adaptation in NLP and why does it matter?
Models trained on general text corpora (Wikipedia, web) often perform poorly on domain-specific text (medical records, legal documents, financial reports). Domain adaptation involves continuing pre-training on domain text, fine-tuning on domain-specific labelled data, or using domain-specific vocabulary. Without adaptation, tokenisation may break domain terms oddly and the model may not recognise domain-specific entities or language patterns.
Take-Home Challenge Guidance
NLP take-home challenges typically ask you to fine-tune a model on a provided dataset, build a text processing pipeline, or evaluate an existing NLP system. The evaluation criteria are consistent across companies:
- Clean, reproducible code with a clear setup README
- Appropriate choice of model and evaluation metrics, with justification
- Thoughtful preprocessing decisions (how did you handle the text data?)
- Honest assessment of results — what worked, what didn't, what you'd try next
- Not just working code, but clear reasoning about your approach
Common mistakes: using accuracy as the only metric when classes are imbalanced, not including a requirements.txt or Dockerfile, submitting a notebook with relative paths that don't run on another machine, and claiming results without proper test set evaluation.
See the full NLP Engineer role guide
Salary benchmarks, required skills, top UK employers, and career progression.
Frequently Asked Questions
What technical topics do NLP interviews cover?
Transformer architecture, tokenisation, BERT vs GPT, NLP task types (NER, classification, QA), evaluation metrics, fine-tuning vs prompt engineering, text preprocessing, and often LLM patterns for blended roles.
What does a typical NLP take-home look like?
Fine-tune a BERT-family model on a provided dataset, build an NER pipeline, or evaluate an NLP system. 3–6 hours. Evaluated on code quality, methodology, and clear reasoning — not just model performance.
Do NLP interviews include coding questions?
Yes — online assessment with LeetCode-style problems (medium difficulty), plus NLP-specific coding (text processing, tokeniser, evaluation function). Python is expected throughout.
What system design questions come up?
Large-scale document classification systems, medical NLP pipelines, news recommendation, contract analysis at scale. Cover data pipeline, model selection, evaluation, and production monitoring.
How many interview stages?
4–5 stages: recruiter screen, online coding, NLP technical interview, take-home challenge, final loop. Total process: 3–5 weeks at most UK companies.