Object Detection & Image Segmentation

Locating and labeling objects in images.

Object Detection

Object detection goes beyond classification. Instead of saying "this image contains a cat," it tells you WHERE the cat is — drawing a bounding box around it and labeling it. It's the technology behind self-driving cars identifying pedestrians, security cameras detecting intruders, and your phone's face detection.

How Object Detection Works

The general pipeline: scan the image, propose regions where objects might be, classify each region, and refine the bounding boxes. Different methods do this in different ways.


    Detection Pipeline
    ──────────────────────────────────────────────────
    │                                                 │
    │  Input Image                                    │
    │       │                                         │
    │       ▼                                         │
    │  Feature Extraction (CNN backbone)              │
    │       │                                         │
    │       ▼                                         │
    │  Region Proposals / Anchor Boxes                │
    │       │                                         │
    │       ▼                                         │
    │  Classification: What is it?                    │
    │  Regression: Where is it? (bounding box)        │
    │       │                                         │
    │       ▼                                         │
    │  Non-Maximum Suppression (remove duplicates)    │
    │       │                                         │
    │       ▼                                         │
    │  Final detections with confidence scores        │
    │                                                 │
    └─────────────────────────────────────────────────

Two-Stage Detectors: R-CNN Family

These detectors work in two stages: first propose regions, then classify them. Accurate but slower.

R-CNN (2014): The original. Uses selective search to propose ~2000 regions, extracts CNN features from each, then classifies. Painfully slow — took 47 seconds per image.

Fast R-CNN (2015): Processes the entire image once with a CNN, then projects region proposals onto the feature map. Much faster.

Faster R-CNN (2015): Uses a Region Proposal Network (RPN) to replace selective search. The whole pipeline becomes end-to-end trainable. The gold standard for accuracy.


    R-CNN Evolution
    ──────────────────────────────────────────────
    │                                             │
    │  R-CNN:  Image → 2000 crops → CNN → SVM    │
    │          (slow, 47s/image)                  │
    │                                             │
    │  Fast R-CNN: Image → CNN → ROI pool → FC   │
    │              (faster, 2s/image)             │
    │                                             │
    │  Faster R-CNN: Image → CNN → RPN → Detect  │
    │                (fast, 0.2s/image)           │
    │                                             │
    │  Accuracy: R-CNN < Fast < Faster            │
    │  Speed:    R-CNN > Fast > Faster            │
    ──────────────────────────────────────────────

One-Stage Detectors: YOLO Family

YOLO (You Only Look Once) skips the region proposal step. It divides the image into a grid and predicts bounding boxes and classes for every grid cell simultaneously. Much faster, with impressive accuracy.

The key insight: treat detection as a single regression problem instead of a classification problem.


    YOLO Grid Approach
    ──────────────────────────────────────────────
    │                                             │
    │  ┌───┬───┬───┬───┬───┬───┬───┐            │
    │  │   │   │   │   │   │   │   │            │
    │  ├───┼───┼───┼───┼───┼───┼───┤            │
    │  │   │   │ 🐱│   │   │   │   │            │
    │  ├───┼───┼───┼───┼───┼───┼───┤            │
    │  │   │   │   │   │   │ 🚗│   │            │
    │  ├───┼───┼───┼───┼───┼───┼───┤            │
    │  │   │   │   │   │   │   │   │            │
    │  └───┴───┴───┴───┴───┴───┴───┘            │
    │                                             │
    │  Each cell predicts:                        │
    │  - B bounding boxes (x, y, w, h, conf)     │
    │  - C class probabilities                    │
    │                                             │
    │  Single forward pass = real-time detection  │
    ──────────────────────────────────────────────

YOLOv1 (2016): The original. Fast but struggled with small objects and overlapping detections.

YOLOv3-v5: Added feature pyramids for multi-scale detection, anchor boxes, and much better accuracy.

YOLOv8 (2023): Anchor-free detection, simpler architecture, state-of-the-art speed-accuracy tradeoff.

SSD: Single Shot MultiBox Detector

Similar to YOLO — one-stage detection but uses feature maps at multiple scales. Detects small objects better than early YOLO versions. Good balance of speed and accuracy.

Image Segmentation

Segmentation takes detection further — instead of bounding boxes, it classifies EVERY pixel. Three types:

Semantic segmentation: Classifies each pixel but doesn't distinguish between objects of the same class. Two cars in one image? Both labeled "car."

Instance segmentation: Classifies pixels AND distinguishes between instances. Car 1 and Car 2 are separate objects.

Panoptic segmentation: Combines both — every pixel gets a class AND instance ID. The most complete understanding.


    Segmentation Types
    ──────────────────────────────────────────────
    │                                             │
    │  Semantic       Instance      Panoptic      │
    │  ┌──────┐      ┌──────┐     ┌──────┐      │
    │  │▓▓▓▓▓▓│      │▓▓▓▓▓▓│     │▓▓▓▓▓▓│      │
    │  │▓▓░░▓▓│      │▓▓██▓▓│     │▓▓██▓▓│      │
    │  │░░░░░░│      │░░░░░░│     │░░░░░░│      │
    │  └──────┘      └──────┘     └──────┘      │
    │  both cars     car1=▓▓     both class      │
    │  same class    car2=██     + instance       │
    ──────────────────────────────────────────────

Real-Time Detection

Speed matters for practical applications. Self-driving cars need detection in milliseconds, not seconds.

YOLO variants dominate real-time detection. YOLOv8-nano runs at 500+ FPS on a modern GPU while maintaining good accuracy.

Model optimization techniques:

Quantization: Convert 32-bit floats to 8-bit integers. 4× smaller, 3× faster, minimal accuracy loss.

Pruning: Remove redundant neurons. Can cut model size by 50-90% with small accuracy drops.

Knowledge distillation: Train a small "student" model to mimic a large "teacher" model. The student learns the teacher's expertise in a compact form.

Non-Maximum Suppression

Without NMS, you'd get dozens of overlapping boxes for the same object. NMS keeps the best box and removes redundant ones:


    Non-Maximum Suppression
    ──────────────────────────────────────────────
    │                                             │
    │  Before NMS:           After NMS:           │
    │  ┌────────────┐        ┌────────────┐      │
    │  │ ┌────────┐ │        │            │      │
    │  │ │┌──────┐│ │        │   ┌────┐   │      │
    │  │ ││┌────┐││ │   →    │   │🐱│   │      │
    │  │ │││ 🐱│││ │        │   └────┘   │      │
    │  │ ││└────┘││ │        │            │      │
    │  │ │└──────┘│ │        └────────────┘      │
    │  │ └────────┘ │                             │
    │  └────────────┘                             │
    │  Multiple overlapping boxes → Best single  │
    ──────────────────────────────────────────────

Getting Started

For practical projects, start with a pre-trained model. YOLOv8 from Ultralytics is the easiest entry point — install, load, predict. For research or maximum accuracy, use Detectron2 (Facebook) or MMDetection (OpenMMLab).

Label your data carefully. Garbage annotations produce garbage models. Tools like LabelImg, CVAT, and Roboflow make annotation easier and support common formats (COCO, Pascal VOC, YOLO).

🧪 Quick Quiz

What is the difference between object detection and image classification?

← Previous Computer Vision Basics

Next → Feature Engineering