Skills Guide

    Computer Vision Skills
    The 2026 Skills Guide

    Computer vision remains one of the most active areas of applied AI in the UK — from autonomous vehicles and medical imaging to retail analytics and manufacturing inspection. This guide covers the architectures, tools, and evaluation practices that UK CV engineers use.

    CV Tasks and Model Landscape

    Image Classification

    Assign one or more class labels to an entire image.

    Key Models

    ResNet, EfficientNet, ViT, ConvNeXt

    Primary Metric

    Top-1 Accuracy, Top-5 Accuracy

    Object Detection

    Predict bounding boxes and class labels for each object in an image.

    Key Models

    YOLOv8/v10/v11, RT-DETR, DINO

    Primary Metric

    mAP@50, mAP@50-95

    Semantic Segmentation

    Assign a class label to each pixel. No distinction between object instances.

    Key Models

    SegFormer, DeepLabv3+, UNet

    Primary Metric

    mIoU (mean IoU across classes)

    Instance Segmentation

    Produce a pixel mask for each individual object instance.

    Key Models

    Mask R-CNN, YOLOv8-seg, SAM

    Primary Metric

    mask AP (like mAP but on masks)

    Pose Estimation

    Detect keypoints (joints) on human bodies or other articulated objects.

    Key Models

    YOLOv8-pose, ViTPose, DWPose

    Primary Metric

    OKS (Object Keypoint Similarity), mAP

    Image Generation

    Generate photorealistic images from text prompts or other conditioning.

    Key Models

    Stable Diffusion, FLUX, ControlNet

    Primary Metric

    FID, CLIP score, human eval

    CNN Architectures and Their Evolution

    ResNet (He et al., 2015) introduced residual connections (skip connections): the output of a block is F(x) + x, where F(x) is the learned residual and x is the input passed through unchanged. This addresses the vanishing gradient problem in deep networks and allowed training networks with 50–150+ layers. ResNet remains widely used as a backbone for downstream tasks (detection, segmentation) and transfer learning. ResNet-50 and ResNet-101 are the most common variants.

    EfficientNet (Tan & Le, 2019) scales network width, depth, and input resolution jointly using a compound scaling coefficient. EfficientNetV2 improved training speed and accuracy further. EfficientNet-B0 through B7 trade off accuracy against computational cost. A strong choice when inference latency on CPU/mobile hardware matters.

    Vision Transformer (Dosovitskiy et al., 2020) applies the transformer encoder directly to sequences of image patches. ViT-B/16 (base, 16×16 patches), ViT-L/16, ViT-H/14. Requires large pretraining datasets to match CNN performance (no translation equivariance prior). DINOv2 (self-supervised ViT pretraining) produces strong general-purpose visual features used for semantic search and classification without fine-tuning.

    ConvNeXt (Liu et al., 2022) modernises the ResNet architecture with transformer-inspired design choices (depthwise convolutions, inverted bottleneck, GELU activation, LayerNorm) while retaining pure CNN architecture. Competitive with ViTs across benchmarks, more efficient at smaller scales.

    YOLO: Production Object Detection

    The YOLO (You Only Look Once) family is the dominant choice for production object detection in the UK. Ultralytics maintains the most widely used implementation:

    • YOLOv8 / YOLO11 — The current production standards. Support detection, segmentation, pose estimation, classification, and oriented bounding boxes (OBB) in a unified API. Train with model = YOLO('yolo11m.pt'); model.train(data='dataset.yaml', epochs=100). Export to ONNX, TensorRT, CoreML, TFLite for deployment.
    • Anchor-free design — Modern YOLO variants (v8 onwards) are anchor-free: they predict bounding box coordinates directly as offsets from grid points, without predefined anchor boxes. This simplifies training and improves generalisation to unusual aspect ratios.
    • IoU (Intersection over Union) — The core bounding box quality metric: IoU = Area of intersection / Area of union. IoU = 1.0 is a perfect prediction; IoU = 0 means no overlap. During training, CIoU (Complete IoU) loss incorporates both overlap, centre point distance, and aspect ratio consistency. A predicted box is considered a true positive (matched detection) if IoU with a ground-truth box exceeds the threshold (typically 0.5 for AP@50).
    • Non-Maximum Suppression (NMS) — Multiple overlapping detections of the same object are suppressed: keep only the highest-confidence detection if any two detections have IoU > nms_threshold. The conf_thres (confidence threshold) and iou_thres (NMS IoU threshold) are the key inference parameters affecting precision/recall trade-off.

    Frequently Asked Questions

    What is the difference between object detection and image segmentation?

    Detection: bounding boxes + class labels per object. Semantic segmentation: per-pixel class labels (no instance distinction). Instance segmentation: per-pixel masks per individual object instance (Mask R-CNN, SAM). Panoptic segmentation combines both. YOLO for detection; SAM/Mask R-CNN for segmentation.

    What is mAP and why is it the standard detection metric?

    mAP (mean Average Precision) jointly evaluates localisation and classification across all confidence thresholds. Per-class: compute precision-recall curve, calculate area under it (AP). mAP = mean AP across all classes. COCO-style mAP averages across IoU thresholds 0.5–0.95, rewarding precise boxes. AP@50 (lenient), AP@75 (tighter) are common variants.

    When should you use YOLO vs Faster R-CNN?

    YOLO (single-stage): real-time speed (30–150+ FPS), right for video surveillance, robotics, mobile deployment. YOLOv8/v10/v11 (Ultralytics) are the production standard. Faster R-CNN (two-stage, RPN + classifier): more accurate especially on small objects, but much slower. In production UK CV applications, YOLO dominates. Two-stage detectors appear mainly in precision-critical medical imaging.

    What is the Vision Transformer (ViT) and how does it differ from CNNs?

    ViT divides an image into fixed patches (e.g. 16×16 pixels), flattens each patch to a vector, adds positional embeddings, and processes them as a sequence through a standard transformer encoder. Attention is global from the first layer — every patch attends to every other patch. CNNs build hierarchical local features through convolutional filters with translation equivariance. ViTs generally outperform CNNs at large scale (ImageNet-21k+ pretraining); CNNs are more data-efficient and inductive-bias-rich at smaller scales.

    What is Albumentations and why is it preferred over torchvision transforms?

    Albumentations is a fast image augmentation library with 70+ augmentation types. It operates on NumPy arrays and is significantly faster than torchvision transforms (uses OpenCV under the hood for many operations). Key advantage: it correctly augments both images and their annotations simultaneously — bounding boxes, segmentation masks, and keypoints are transformed consistently with the image. torchvision transforms don't support annotation-aware augmentation natively. Albumentations is the production standard for CV training in the UK.

    Browse Computer Vision Jobs

    Find CV engineering roles at UK AI companies in automotive, medical imaging, retail, and more.

    Quick Facts

    Demand level
    High
    Difficulty
    Advanced
    Time to proficiency4–8 months

    Key Tools

    YOLO
    OpenCV
    torchvision
    Albumentations
    SAM
    ViT
    Detectron2
    ONNX