Question 1

What is the difference between object detection and image segmentation?

Accepted Answer

Object detection predicts bounding boxes (x, y, width, height) and class labels for each detected object in an image. It answers 'where is the object and what class is it?' Semantic segmentation classifies every pixel in the image as belonging to one of N classes, with no distinction between individual object instances of the same class. Instance segmentation combines both: it produces a per-pixel mask for each individual object instance (each person, each car) and a class label. Panoptic segmentation unifies semantic and instance segmentation into a single task. YOLO models perform object detection; SAM (Segment Anything Model by Meta AI) and Mask R-CNN perform instance segmentation.

Question 2

What is mAP and why is it the standard evaluation metric for object detection?

Accepted Answer

mAP (mean Average Precision) is the standard evaluation metric for object detection because it jointly evaluates both localisation quality (how accurate are the bounding boxes) and classification quality (are the correct classes detected) across all detection confidence thresholds. For each class, precision-recall curves are computed by sweeping the confidence threshold. Average Precision (AP) is the area under this curve. mAP is the mean AP across all classes. COCO-style mAP (used in most modern benchmarks) further averages across IoU thresholds from 0.5 to 0.95 in 0.05 steps — this rewards precise bounding boxes more than the simpler PASCAL VOC mAP@0.5 metric. AP@50 (IoU threshold 0.5) is more lenient; AP@75 (IoU threshold 0.75) requires tighter box predictions.

Question 3

When should you use YOLO vs a two-stage detector like Faster R-CNN?

Accepted Answer

YOLO (You Only Look Once) is a single-stage detector: it predicts all bounding boxes and classes in a single forward pass. It is fast (real-time inference at 30–150+ FPS), making it the right choice for latency-critical applications: video surveillance, autonomous driving, robotics, mobile deployment. YOLOv8, YOLOv10, and YOLOv11 (Ultralytics) are the dominant choices. Faster R-CNN and its variants (Mask R-CNN, Cascade R-CNN) are two-stage detectors: a Region Proposal Network (RPN) first proposes candidate boxes, then a classification and regression head refines them. Two-stage detectors are generally more accurate (especially on small objects) but significantly slower. In production UK applications, YOLO is the dominant choice; two-stage detectors appear mainly in academic benchmarks and precision-critical medical imaging.

Computer Vision Skills
The 2026 Skills Guide

CV Tasks and Model Landscape

CNN Architectures and Their Evolution

YOLO: Production Object Detection

Frequently Asked Questions

What is the difference between object detection and image segmentation?

What is mAP and why is it the standard detection metric?

When should you use YOLO vs Faster R-CNN?

What is the Vision Transformer (ViT) and how does it differ from CNNs?

What is Albumentations and why is it preferred over torchvision transforms?

Browse Computer Vision Jobs

Quick Facts

Key Tools

Related Skills

Computer Vision SkillsThe 2026 Skills Guide

CV Tasks and Model Landscape

CNN Architectures and Their Evolution

YOLO: Production Object Detection

Frequently Asked Questions

What is the difference between object detection and image segmentation?

What is mAP and why is it the standard detection metric?

When should you use YOLO vs Faster R-CNN?

What is the Vision Transformer (ViT) and how does it differ from CNNs?

What is Albumentations and why is it preferred over torchvision transforms?

Browse Computer Vision Jobs

Quick Facts

Key Tools

Related Skills

Computer Vision Skills
The 2026 Skills Guide