Object Detection
Object detection goes beyond classification. Instead of saying "this image contains a cat," it tells you WHERE the cat is โ drawing a bounding box around it and labeling it. It's the technology behind self-driving cars identifying pedestrians, security cameras detecting intruders, and your phone's face detection.
How Object Detection Works
The general pipeline: scan the image, propose regions where objects might be, classify each region, and refine the bounding boxes. Different methods do this in different ways.
Detection Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Input Image โ
โ โ โ
โ โผ โ
โ Feature Extraction (CNN backbone) โ
โ โ โ
โ โผ โ
โ Region Proposals / Anchor Boxes โ
โ โ โ
โ โผ โ
โ Classification: What is it? โ
โ Regression: Where is it? (bounding box) โ
โ โ โ
โ โผ โ
โ Non-Maximum Suppression (remove duplicates) โ
โ โ โ
โ โผ โ
โ Final detections with confidence scores โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Two-Stage Detectors: R-CNN Family
These detectors work in two stages: first propose regions, then classify them. Accurate but slower.
R-CNN (2014): The original. Uses selective search to propose ~2000 regions, extracts CNN features from each, then classifies. Painfully slow โ took 47 seconds per image.
Fast R-CNN (2015): Processes the entire image once with a CNN, then projects region proposals onto the feature map. Much faster.
Faster R-CNN (2015): Uses a Region Proposal Network (RPN) to replace selective search. The whole pipeline becomes end-to-end trainable. The gold standard for accuracy.
R-CNN Evolution
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ R-CNN: Image โ 2000 crops โ CNN โ SVM โ
โ (slow, 47s/image) โ
โ โ
โ Fast R-CNN: Image โ CNN โ ROI pool โ FC โ
โ (faster, 2s/image) โ
โ โ
โ Faster R-CNN: Image โ CNN โ RPN โ Detect โ
โ (fast, 0.2s/image) โ
โ โ
โ Accuracy: R-CNN < Fast < Faster โ
โ Speed: R-CNN > Fast > Faster โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
One-Stage Detectors: YOLO Family
YOLO (You Only Look Once) skips the region proposal step. It divides the image into a grid and predicts bounding boxes and classes for every grid cell simultaneously. Much faster, with impressive accuracy.
The key insight: treat detection as a single regression problem instead of a classification problem.
YOLO Grid Approach
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโ โ
โ โ โ โ โ โ โ โ โ โ
โ โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค โ
โ โ โ โ ๐ฑโ โ โ โ โ โ
โ โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค โ
โ โ โ โ โ โ โ ๐โ โ โ
โ โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค โ
โ โ โ โ โ โ โ โ โ โ
โ โโโโโดโโโโดโโโโดโโโโดโโโโดโโโโดโโโโ โ
โ โ
โ Each cell predicts: โ
โ - B bounding boxes (x, y, w, h, conf) โ
โ - C class probabilities โ
โ โ
โ Single forward pass = real-time detection โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
YOLOv1 (2016): The original. Fast but struggled with small objects and overlapping detections.
YOLOv3-v5: Added feature pyramids for multi-scale detection, anchor boxes, and much better accuracy.
YOLOv8 (2023): Anchor-free detection, simpler architecture, state-of-the-art speed-accuracy tradeoff.
SSD: Single Shot MultiBox Detector
Similar to YOLO โ one-stage detection but uses feature maps at multiple scales. Detects small objects better than early YOLO versions. Good balance of speed and accuracy.
Image Segmentation
Segmentation takes detection further โ instead of bounding boxes, it classifies EVERY pixel. Three types:
Semantic segmentation: Classifies each pixel but doesn't distinguish between objects of the same class. Two cars in one image? Both labeled "car."
Instance segmentation: Classifies pixels AND distinguishes between instances. Car 1 and Car 2 are separate objects.
Panoptic segmentation: Combines both โ every pixel gets a class AND instance ID. The most complete understanding.
Segmentation Types
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Semantic Instance Panoptic โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ both cars car1=โโ both class โ
โ same class car2=โโ + instance โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Real-Time Detection
Speed matters for practical applications. Self-driving cars need detection in milliseconds, not seconds.
YOLO variants dominate real-time detection. YOLOv8-nano runs at 500+ FPS on a modern GPU while maintaining good accuracy.
Model optimization techniques:
Quantization: Convert 32-bit floats to 8-bit integers. 4ร smaller, 3ร faster, minimal accuracy loss.
Pruning: Remove redundant neurons. Can cut model size by 50-90% with small accuracy drops.
Knowledge distillation: Train a small "student" model to mimic a large "teacher" model. The student learns the teacher's expertise in a compact form.
Non-Maximum Suppression
Without NMS, you'd get dozens of overlapping boxes for the same object. NMS keeps the best box and removes redundant ones:
Non-Maximum Suppression
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Before NMS: After NMS: โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ โโโโโโโโโโ โ โ โ โ
โ โ โโโโโโโโโโ โ โ โโโโโโ โ โ
โ โ โโโโโโโโโโ โ โ โ โ๐ฑโ โ โ
โ โ โโโ ๐ฑโโโ โ โ โโโโโโ โ โ
โ โ โโโโโโโโโโ โ โ โ โ
โ โ โโโโโโโโโโ โ โโโโโโโโโโโโโโ โ
โ โ โโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโ โ
โ Multiple overlapping boxes โ Best single โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Getting Started
For practical projects, start with a pre-trained model. YOLOv8 from Ultralytics is the easiest entry point โ install, load, predict. For research or maximum accuracy, use Detectron2 (Facebook) or MMDetection (OpenMMLab).
Label your data carefully. Garbage annotations produce garbage models. Tools like LabelImg, CVAT, and Roboflow make annotation easier and support common formats (COCO, Pascal VOC, YOLO).