Interview Prep

    MLOps Engineer Interview Questions UK
    Technical & Behavioural Guide 2026

    10 technical questions covering CI/CD for ML, Kubernetes, model monitoring, feature stores, and infrastructure — with strong example answers, behavioural prep, and employer red flags.

    The Interview Process

    Stage 1: Recruiter screen (30 min)

    Background, cloud platform experience (AWS/GCP/Azure), and specific MLOps tooling. Prepare to discuss a production ML pipeline you've built end-to-end.

    Stage 2: Infrastructure coding (45–60 min)

    Write Terraform, Kubernetes manifests, or a Dockerfile. Python scripting for pipeline automation is also common. Focus on correctness and production-readiness.

    Stage 3: ML system design (45–60 min)

    Design a training pipeline, feature store, or model serving infrastructure. Cover monitoring, rollback, and cost — interviewers want to see you think beyond the happy path.

    Stage 4: Technical deep-dive (45 min)

    Deep questions on topics from your CV — Kubernetes internals, CI/CD patterns, model drift detection, or specific tooling (MLflow, Kubeflow, Vertex AI).

    Stage 5: Behavioural (45 min)

    Incidents, cross-functional collaboration, and architectural decisions. Senior roles include questions on platform strategy and team enablement.

    Technical Questions

    Write your own answer first, then compare against the example.

    Q1. How would you design a CI/CD pipeline for a machine learning model?

    Strong answer

    An ML CI/CD pipeline differs from software CI/CD because it must version data and models, not just code. Key stages: (1) Trigger — code push, data update, or scheduled retraining. (2) Data validation — Great Expectations or Pandera checks that the training data meets schema and quality expectations. (3) Model training — reproducible training run with logged hyperparameters (MLflow). (4) Model evaluation — compare against the current production model on a holdout set; only promote if the challenger meets defined thresholds. (5) Model registration — push artefact to model registry with metadata. (6) Deployment — canary or shadow deploy to a staging environment, run integration tests, promote to production. Use tools like GitHub Actions, Kubeflow Pipelines, or Metaflow. Everything must be reproducible from a commit hash.

    Q2. What is model drift and how do you detect and respond to it?

    Strong answer

    Two types: data drift (the distribution of input features changes) and concept drift (the relationship between inputs and the target changes). Detection: (1) Data drift — monitor statistical properties of incoming features (mean, variance, Kullback-Leibler divergence, Population Stability Index) against training distribution. (2) Concept drift — monitor model output distribution and business metrics (conversion rate, prediction accuracy on a sample). Tools: Evidently AI, WhyLabs, Arize. Response: (1) Alert and investigate — is it genuine drift or a data pipeline issue? (2) Trigger retraining with recent data. (3) Review whether the feature set needs updating. Set up automated retraining pipelines but require human approval before production promotion.

    Q3. Explain how Kubernetes is used in ML infrastructure. What are pods, deployments, and services?

    Strong answer

    Kubernetes is the standard orchestration layer for scalable ML inference. A Pod is the smallest deployable unit — one or more containers that share networking and storage. A Deployment manages a set of identical pods, handles rolling updates, and maintains a desired replica count. A Service provides a stable endpoint (IP/DNS) for reaching pods, abstracting away pod churn. In ML: model servers run as Deployments (e.g. vLLM, TF Serving), exposed via a Service. For training jobs, use Kubernetes Jobs (run-to-completion) or specialised operators (Kubeflow Training Operator for distributed training). GPU resources are requested via resource limits on pods. Horizontal Pod Autoscalers scale inference replicas based on CPU/GPU utilisation or custom metrics like request queue depth.

    Q4. What is a feature store and why is it valuable in a production ML system?

    Strong answer

    A feature store is a centralised system for storing, sharing, and serving ML features. It solves the training-serving skew problem: ensuring the features used during training are identical to those computed at serving time. Components: (1) Offline store (e.g. S3, BigQuery) for historical feature values used in training. (2) Online store (e.g. Redis, DynamoDB) for low-latency feature serving at inference time. (3) Feature transformation logic, versioned and shared across teams. Value: prevents duplicate feature computation across teams, enables feature reuse, and provides a single source of truth. Popular options: Feast (open-source), Tecton, Hopsworks. Most valuable at organisations with multiple ML teams building related products.

    Q5. How do you ensure ML experiment reproducibility?

    Strong answer

    Reproducibility requires versioning all inputs and capturing all configuration. Checklist: (1) Code — git commit hash logged with every experiment. (2) Data — immutable dataset versions, stored with checksums (DVC, Delta Lake, or raw S3 URIs). (3) Hyperparameters — all parameters logged (MLflow, W&B, Comet). (4) Environment — Docker image with pinned dependencies or a requirements.txt with hashes. (5) Random seeds — set for numpy, Python random, and deep learning frameworks. (6) Hardware — note GPU type if results are hardware-sensitive. Experiments that can't be reproduced can't be debugged, compared, or audited — this is a compliance requirement in regulated industries.

    Q6. How would you implement shadow deployment for a new ML model?

    Strong answer

    Shadow deployment (also called dark launch) runs the new model in parallel with the production model without serving its outputs to users. The production model's response is returned to the user; the shadow model's response is logged for offline evaluation. Implementation: add a middleware layer that duplicates incoming requests and asynchronously sends them to the shadow model endpoint. Log both responses with a shared request ID. Compare offline: accuracy, latency, output distribution. Promote the shadow model when it meets defined quality thresholds with no regressions. Risk: doubles inference cost during the shadow period. Mitigate by sampling (e.g. shadow 5% of traffic).

    Q7. What is the difference between model monitoring and data monitoring in MLOps?

    Strong answer

    Data monitoring focuses on the inputs to the model: schema validation (expected columns, types), distribution monitoring (feature drift), data freshness (is the pipeline delivering up-to-date data?), and completeness (are there missing values or null rates above threshold?). Model monitoring focuses on the outputs and behaviour: prediction distribution drift, performance metrics (accuracy, precision, recall) on labelled samples, latency, error rates, and business KPIs. Both are necessary. Data issues often manifest as model quality issues — a broken upstream pipeline can cause silent degradation that looks like concept drift. Run data monitors as early as possible in the pipeline to catch issues before they reach the model.

    Q8. How do you manage model versioning and rollback in production?

    Strong answer

    Model versioning requires: (1) A model registry (MLflow Model Registry, AWS SageMaker Model Registry, Vertex AI Model Registry) that stores model artefacts with version numbers, metadata, and stage labels (Staging, Production, Archived). (2) Immutable artefacts — never overwrite a model version in the registry. (3) Linked deployment records — each production deployment is linked to a specific model version. Rollback procedure: promote the previous registered model version to Production stage; update the serving infrastructure to load the new version (via config change or Kubernetes deployment update); verify the rollback is serving correctly before removing the problematic version. Always test rollback procedures in staging — don't discover they're broken during an incident.

    Q9. What is infrastructure as code (IaC) and why does it matter in MLOps?

    Strong answer

    IaC means defining infrastructure (compute clusters, storage, networking, model serving endpoints) in configuration files rather than through manual console actions. Tools: Terraform (provider-agnostic), AWS CloudFormation, Pulumi. In MLOps, IaC ensures: (1) Reproducibility — you can recreate the exact infrastructure from a git repository. (2) Auditability — all infrastructure changes go through code review and are tracked in git. (3) Consistency — staging and production environments are defined by the same templates, reducing 'works in staging, breaks in production' issues. (4) Disaster recovery — the full stack can be rebuilt from code. A team that manages ML infrastructure without IaC accumulates undocumented, unreviable snowflake infrastructure.

    Q10. How do you approach cost management for a large-scale ML training and inference platform?

    Strong answer

    Cost management is a first-class engineering concern. Training cost: use spot/preemptible instances for batch training (60–80% savings with checkpointing); right-size GPU instances (don't use A100s for small experiments); implement experiment budgets and early stopping. Inference cost: right-size model (distillation, quantisation to reduce compute); horizontal autoscaling to match capacity to traffic; use spot instances for non-latency-sensitive batch inference; implement request caching. Visibility: tag all cloud resources with team/project/experiment identifiers; set up per-team cost dashboards and alerts on anomalous spend. Cost review should be a regular part of ML project planning, not an afterthought.

    Behavioural Questions

    Use STAR format and keep answers to 2–3 minutes.

    Tell me about an ML system you built that had a production incident. What happened and what did you change afterwards?

    Shows real production experience and a learning mindset. Focus on the systemic change, not just the fix.

    How have you worked with ML scientists or data scientists who had less infrastructure experience?

    MLOps engineers often need to abstract complexity for model builders. Show you can design systems that others can use without becoming infrastructure experts.

    Describe a time you had to significantly refactor an ML pipeline. What drove the decision and how did you manage the risk?

    Tests ability to work with existing systems and migrate carefully. Good answer includes staged migration and testing strategy.

    How do you balance standardisation (one platform, one framework) with the needs of different ML teams?

    MLOps platforms must serve multiple users. Shows your ability to make opinionated decisions while remaining pragmatic about edge cases.

    Walk me through how you set up observability for an ML system from scratch.

    Tests depth of monitoring knowledge. Best answers cover data, model, and infrastructure layers with specific tooling choices and their rationale.

    Red Flags to Watch For

    No model registry or versioning

    Without versioned model artefacts, rollbacks are impossible and auditing is guesswork.

    Training and serving pipelines disconnected

    If the same feature transformations aren't used in training and serving, training-serving skew is inevitable and debugging is extremely hard.

    No alerting on model quality

    If nobody is notified when a model's prediction distribution changes, degradation goes undetected until it causes a business problem.

    Infrastructure managed through the console

    If cloud resources are created manually rather than through IaC, the environment is not reproducible and change management is ungoverned.

    No staging environment for ML

    Deploying models directly to production without a staging validation step means every deployment is a risk.

    Preparation Resources

    MLOps: Machine Learning Engineering in Production (Andrew Ng)

    The most comprehensive MLOps curriculum available

    Made With ML – MLOps Course

    Practical, code-first MLOps curriculum

    MLflow documentation

    Industry-standard experiment tracking and model registry

    Kubernetes documentation

    Read the Concepts section — understand pods, deployments, and services

    MLOps Engineer Salary Guide

    Understand market compensation before negotiating

    Ready to apply?

    Browse live MLOps engineer roles across the UK — updated daily.