Developer preparing for a technical interview on a laptop
    Interview Prep

    Computer Vision Engineer
    Interview Questions UK 2026

    AM

    Alex Morgan

    AI Careers Editor

    May 3, 2026
    12 min read

    UK computer vision interviews have a specific pattern. Understanding what each stage tests — and what a strong answer looks like — is what separates well-prepared candidates from technically capable ones who don't get offers.

    How the Interview Process Is Structured

    Most UK CV engineering interviews follow a 4–5 stage process. The key difference from general ML engineering interviews is the addition of a CV theory interview — a dedicated technical round focused on architecture knowledge, image processing fundamentals, and domain-specific problem solving. This stage is what separates genuine CV engineers from generalist ML candidates.

    Stage 1 — Recruiter screen: 20–30 minutes. Covers background, motivations, and expectations. Prepare a clear narrative about your CV engineering background and why you're interested in the specific role.

    Stage 2 — Online coding assessment: 1–2 hours. LeetCode-style algorithmic problems, typically medium difficulty. Python is expected. Practice sliding window, two-pointer, dynamic programming basics, and graph traversal. Not CV-specific at this stage.

    Stage 3 — Technical interview: 60–90 minutes with 1–2 engineers. Covers Python coding questions, CV architecture questions, and often a short image processing task on a whiteboard or shared screen.

    Stage 4 — Take-home challenge: 4–8 hours. Build a small but complete CV system. Quality over novelty — they want clean code and clear reasoning, not a state-of-the-art result.

    Stage 5 — Final loop: 2–4 hours. System design for CV, culture and team fit interviews, and often a walkthrough of your take-home challenge.

    CV Architecture Questions (Most Commonly Asked)

    Architecture and theory questions

    • Explain how YOLO works. What are the trade-offs vs two-stage detectors like Faster R-CNN?
      Strong answer covers: single-pass architecture, anchor boxes (earlier versions) or anchor-free (YOLOv8+), speed vs accuracy trade-off, why two-stage detectors have higher accuracy for small objects.
    • What's the difference between semantic segmentation, instance segmentation, and panoptic segmentation?
      Strong answer: semantic assigns a class to each pixel; instance differentiates individual objects of the same class; panoptic combines both. Mention Mask R-CNN for instance, DeepLab for semantic.
    • Why did ResNet solve the vanishing gradient problem? What is a skip connection?
      Strong answer: residual connections allow gradients to flow directly during backpropagation, enabling much deeper networks. The network learns residuals rather than direct mappings.
    • How do Vision Transformers (ViT) differ from CNNs? When would you choose one over the other?
      CNNs have inductive biases (spatial locality, translation invariance) that work well with limited data. ViTs learn global relationships via self-attention and tend to outperform CNNs with large datasets and compute. Choose ViT for large-scale tasks; CNN for data-limited or real-time applications.
    • What is Non-Maximum Suppression (NMS) and why is it needed?
      Object detectors produce multiple overlapping bounding boxes for the same object. NMS keeps the box with the highest confidence score and removes others with high IoU overlap above a threshold.

    Image Processing and OpenCV Questions

    • What preprocessing steps would you apply to an image before passing it to a detection model?
      Resize to model input size (with letterboxing to preserve aspect ratio), normalise pixel values, convert colour space if needed (BGR to RGB for PyTorch), apply augmentation during training (flip, rotate, colour jitter, random crop).
    • Explain the difference between morphological dilation and erosion.
      Dilation expands bright regions (adds pixels at boundaries), useful for connecting broken components. Erosion shrinks them, useful for removing noise. Often combined: closing (dilation then erosion) fills holes; opening (erosion then dilation) removes noise.
    • How would you detect edges in an image? What are the trade-offs between Canny and Sobel?
      Sobel computes gradient magnitude, fast but noisy. Canny is a multi-stage algorithm (Gaussian smoothing, Sobel gradients, non-maximum suppression, hysteresis thresholding) — more accurate and reliable but slower.
    • What is camera calibration and when do you need it?
      Calibration estimates intrinsic parameters (focal length, principal point, distortion coefficients) and, for multi-camera systems, extrinsic parameters (rotation and translation between cameras). Needed for any application requiring metric 3D reconstruction, stereo depth estimation, or augmented reality.

    Take-Home Challenge: What They're Evaluating

    The take-home challenge is almost always about code quality and reasoning, not achieving state-of-the-art results. Typical tasks: fine-tune a detection model on a provided dataset, implement an image similarity search, or build a preprocessing pipeline for a specific image type.

    What strong submissions include:

    • Clean, reproducible code with a clear README
    • A clear problem formulation (what you're solving and why)
    • Thoughtful evaluation (appropriate metrics, test set evaluation)
    • Discussion of trade-offs and what you'd do differently with more time
    • Not just a working model — an explanation of your methodology

    Common mistakes: Overfitting to the training set, using inappropriate metrics without explanation, submitting a notebook without reproducible environment setup, failing to mention limitations.

    System Design for Computer Vision

    The system design round for CV roles tests whether you can think beyond a trained model to a deployed, scalable CV system. Common prompts: "Design a real-time pedestrian detection system for a retail store," "How would you build a quality control system for a manufacturing line," or "Design a visual search system for an e-commerce platform."

    A strong system design answer covers: data ingestion and preprocessing, model selection and trade-offs (latency vs accuracy), serving infrastructure (batch vs real-time, edge vs cloud), monitoring and drift detection, and how to handle model updates without downtime.

    See the full Computer Vision Engineer role guide

    Salary benchmarks, required skills, top UK employers, and career progression.

    Frequently Asked Questions

    What does a CV technical interview cover?

    Python coding, CV architecture knowledge, image processing fundamentals, system design for CV systems, and a take-home challenge. The CV theory round is what distinguishes this from a general ML interview.

    What are the most common CV interview questions?

    How does YOLO work, semantic vs instance segmentation, how ResNet addresses vanishing gradients, NMS, ViT vs CNN trade-offs, camera calibration, and preprocessing pipelines.

    How long is the take-home challenge?

    Typically 4–8 hours. Focus on code quality, clear methodology, and thoughtful evaluation — not state-of-the-art results.

    Do I need to know maths?

    Conceptual understanding, not derivations. Know what convolution computes, why skip connections work, what IoU measures, and the geometry behind camera calibration.

    How many stages is a typical CV interview?

    4–5 stages: recruiter screen, online coding assessment, technical interview, take-home challenge, and final loop. Total process typically takes 3–5 weeks.

    Get career tips delivered to your inbox

    Get weekly insights on tech careers, salaries, and industry trends.

    We'll send you relevant job alerts and career content. Unsubscribe anytime. See our Privacy Policy.