The Interview Process
Stage 1: Research presentation (30–60 min)
Present your most significant work to a panel. Prepare a 20-minute talk covering motivation, method, results, and limitations. Expect probing questions on every design decision.
Stage 2: Paper discussion (45–60 min)
You'll be asked to discuss a paper — either one of yours or a recent paper from the literature. Demonstrate critical thinking, not just summarisation.
Stage 3: Technical / maths round (45–60 min)
Probability, statistics, linear algebra, and calculus. Deriving backpropagation, proving convergence properties, or working through a Bayesian inference problem from first principles.
Stage 4: Coding round (45–60 min)
Implement a model or algorithm from scratch in Python. Quality and correctness matter more than speed. Write clean, testable code and explain your choices.
Stage 5: Research vision (30–45 min)
Where do you think your field is going? What problems are most important? Tests long-range thinking and whether your interests align with the team's agenda.
Technical Questions
Write your own answer first, then compare against the example.
Q1. Walk me through your most significant research contribution. What was the core idea and what evidence supported it?
Strong answer
This is the most important question in any research interview and the one most candidates prepare least rigorously. Structure: (1) Motivation — what problem were you solving and why was it unsolved or inadequately addressed? (2) Hypothesis — what was the core claim? (3) Method — how did you test it? What baselines did you compare against, and why those baselines? (4) Results — what did the evidence show? Be specific about numbers. (5) Limitations — what doesn't your work show? Strong researchers proactively discuss limitations. (6) Impact — has it been cited, replicated, or productionised? Prepare to defend every design decision. Interviewers will probe the hardest choices you made.
Q2. Explain the attention mechanism in transformers. What problem does it solve compared to RNNs?
Strong answer
Attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between a query and a set of keys. Self-attention lets each token attend to every other token in the sequence, computing: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The scaling by √dₖ prevents the dot products from growing too large in high dimensions, stabilising gradients. Compared to RNNs: (1) Parallelisable — attention has no sequential dependency, enabling efficient training on GPUs. (2) No vanishing gradient over long sequences — attention directly connects any two positions. (3) Interpretable attention weights. The limitation is quadratic complexity with sequence length — addressed by sparse attention, linear attention, and state space models (Mamba).
Q3. How do you design an ablation study, and why is it important?
Strong answer
An ablation study systematically removes or modifies components of a system to measure each component's contribution to overall performance. It's essential for scientific validity — without it, you can't distinguish which design choices actually matter from those that are incidental. Design principles: (1) Change one thing at a time. (2) Use the same evaluation protocol for all ablation variants. (3) Report variance across multiple runs, not just the mean. (4) Include a 'full model' and progressively ablate each component. Common pitfall: removing components that interact — ablating them individually understates their combined effect. In competitive ML, interviewers often ask 'which component is doing the work here?' — a good ablation study answers this definitively.
Q4. What is the difference between inductive and transductive learning?
Strong answer
Inductive learning learns a general function from training data that can be applied to unseen examples — the standard ML paradigm. The model generalises from the training distribution to an unseen test distribution. Transductive learning uses specific test instances during training to make predictions for those exact instances — it doesn't produce a general function. Semi-supervised methods like label propagation on graphs are transductive. Standard supervised learning is inductive. The distinction matters for evaluation: transductive methods can peek at test instance features (but not labels) during training, which can artificially inflate performance if not accounted for in benchmarks. Few-shot learning at inference time has transductive components (the support set).
Q5. How would you critique a paper that claims state-of-the-art results on a benchmark?
Strong answer
A rigorous critique covers: (1) Benchmark validity — does performance on this benchmark generalise to real-world utility? Many benchmarks are saturated or poorly correlated with practical value. (2) Baseline selection — are the baselines strong and contemporaneous? Cherry-picked weak baselines inflate apparent gains. (3) Hyperparameter tuning — were the baselines tuned as carefully as the proposed method? (4) Statistical significance — is variance reported? Is the improvement outside the noise floor? (5) Compute budget — if the new method uses 10x more compute, is the comparison fair? (6) Reproducibility — is the code available? Have others reproduced the results? State-of-the-art claims require scrutiny proportional to their magnitude.
Q6. Explain backpropagation. What can go wrong and how is it addressed?
Strong answer
Backpropagation applies the chain rule to compute gradients of the loss with respect to all parameters. In a forward pass, activations are computed layer by layer. In the backward pass, the gradient of the loss flows back through each layer, scaled by the derivative of each activation function. Problems: (1) Vanishing gradients — gradients become exponentially small in deep networks with saturating activations (sigmoid, tanh). Fixed by: ReLU activations, residual connections, batch normalisation, careful initialisation (He init for ReLU). (2) Exploding gradients — gradients grow exponentially. Fixed by: gradient clipping. (3) Dead neurons (ReLU) — neurons that output 0 for all inputs, contributing no gradient. Fixed by: Leaky ReLU, initialisation choices. Modern architectures (transformers, residual networks) are designed specifically to make gradient flow well.
Q7. What is the difference between frequentist and Bayesian approaches to probability? When does the distinction matter in ML?
Strong answer
Frequentist probability treats probability as the long-run frequency of events. Parameters are fixed unknowns; data is random. Bayesian probability treats probability as a degree of belief. Parameters are random variables with prior distributions; we update beliefs using Bayes' theorem to get a posterior. In ML, the distinction matters when: (1) Data is scarce — Bayesian methods incorporate prior knowledge to prevent overfitting. (2) Uncertainty quantification is required — Bayesian approaches naturally produce calibrated uncertainty estimates. (3) Online learning — Bayesian methods update incrementally. In practice, most production ML is frequentist (MAP estimation, maximum likelihood). Bayesian methods are more computationally expensive but increasingly tractable through variational inference and MCMC.
Q8. How do you approach negative results in your research? How do you decide whether to publish them?
Strong answer
Negative results — where the hypothesis was not supported — are scientifically valuable but undervalued in publication culture. They prevent others from pursuing dead ends and improve the scientific record. How to approach them: (1) First, verify the result is genuinely negative, not an implementation bug or insufficient compute. (2) Characterise what specifically didn't work and under what conditions. (3) Ask whether the negative result is informative about the hypothesis — did it fail in an interesting way that reveals something about the problem? Reasons to publish: the negative result is surprising given the literature; others are likely to try the same thing; the experimental rigour is high. Venues like the ICML workshop on 'Negative Results' specifically value this work.
Q9. How do you ensure your research code is reproducible?
Strong answer
Reproducibility in research requires more discipline than production engineering because experiments are done quickly and rarely revisited. Minimum standard: (1) Seed all random number generators at experiment start. (2) Pin all dependency versions (requirements.txt or conda env with hashes). (3) Log all hyperparameters to an experiment tracker (W&B, MLflow). (4) Version datasets — store them in a stable location and log the exact dataset version used. (5) Commit your code and log the git commit hash with every experiment. Best practice: write a README that lets someone reproduce your main result from scratch with a single command. Provide this with paper submissions. The ML reproducibility crisis (many SOTA results can't be reproduced) is a real problem; strong labs invest in reproducibility infrastructure.
Behavioural Questions
Use STAR format and keep answers to 2–3 minutes.
Describe a research direction you pursued that turned out to be a dead end. How did you recognise it and what did you do next?
Research involves failure. Shows intellectual honesty, ability to course-correct, and how you manage uncertainty.
How do you decide which research problems are worth working on?
Tests research judgment. The best candidates have a principled view of what makes a problem important, tractable, and novel.
Walk me through how you'd present your research to a non-specialist audience.
Communication is a core research skill. Shows ability to identify the essential insight and explain it without jargon.
How do you collaborate with engineers who need to productionise your research?
Industry research roles require bridging science and engineering. Demonstrates awareness of the implementation gap.
How do you stay current with the pace of publications in your field?
Tests information diet and prioritisation. The best answer shows depth (deeply engaging with a few papers) over breadth (skimming everything).
Red Flags to Watch For
No clear research agenda or direction
If the team can't articulate what questions they're trying to answer and why, you'll work on scattered, low-impact problems.
Publication pressure over scientific rigour
If the incentive is number of papers rather than quality of contribution, the research culture will cut corners on validation and reproducibility.
No path from research to product
In industry roles, research that never influences a product has limited organisational value. Ask how recent research has been used.
No compute budget clarity
Research requires significant compute. If budget allocation is opaque or competitive, experiments will be limited in scope.
Isolation from engineering teams
Research silos produce work that can't be reproduced or scaled. Ask how researchers and engineers collaborate.