Computer Vision Basics

Making machines see and interpret images.

Computer Vision Basics

Computer vision teaches machines to "see" — interpreting images and videos the way humans do. It's behind facial recognition, self-driving cars, medical imaging, and Instagram filters.

An image is just a grid of numbers. A 224×224 color image is a 224×224×3 tensor (height × width × RGB channels). Each pixel is a number from 0 to 255. The challenge is extracting meaning from these numbers.

Image Representation

Before any processing, you need to understand how images are stored:


    Image as Numbers
    ──────────────────────────────────────────────
    │                                             │
    │  Grayscale: 2D matrix (H × W)              │
    │  ┌────────────────────┐                     │
    │  │ 142  89  34  200  │  ← pixel values     │
    │  │ 67  180  95  178  │    (0 = black       │
    │  │ 234 120  45  156  │     255 = white)    │
    │  └────────────────────┘                     │
    │                                             │
    │  Color: 3D tensor (H × W × 3)              │
    │  ┌────────────────────┐                     │
    │  │ R: [142, 89, ...]  │  ← three channels  │
    │  │ G: [67, 180, ...]  │    stacked          │
    │  │ B: [234, 120, ...] │                     │
    │  └────────────────────┘                     │
    ──────────────────────────────────────────────

Image Preprocessing

Neural networks are picky about input. Standard preprocessing steps:

Resizing: All images must be the same size. Common sizes: 224×224, 256×256, 299×299.

Normalization: Scale pixel values to [0,1] or standardize to have mean 0 and std 1. Helps training converge faster.

Data augmentation: Artificially increase dataset size by applying random transformations — rotations, flips, crops, color jitter. Helps prevent overfitting.


    Data Augmentation Examples
    ──────────────────────────────────────────────
    │  Original  │  Flipped  │  Rotated  │ Cropped│
    │  ┌──────┐  │  ┌──────┐ │  ┌──────┐ │┌────┐ │
    │  │  🐱  │  │  │  🐱  │ │  │  /🐱 │ ││ 🐱 │ │
    │  │      │  │  │      │ │  │ /    │ ││    │ │
    │  └──────┘  │  └──────┘ │  └──────┘ │└────┘ │
    │                                             │
    │  One image → many training samples          │
    ──────────────────────────────────────────────

Feature Extraction

Before deep learning, engineers hand-crafted features to detect edges, corners, and textures. While outdated for most tasks, understanding these helps appreciate what CNNs learn automatically.

Edge detection: Identifying boundaries where intensity changes sharply. Simple convolution kernels can detect horizontal, vertical, or diagonal edges.

Corner detection: Finding points where edges meet. Corners are distinctive and useful for matching features across images.

Histogram of Oriented Gradients (HOG): Captures shape information by counting edge directions in local regions. Was the go-to for pedestrian detection before CNNs.


    Edge Detection Kernels
    ──────────────────────────────────────────────
    │ Horizontal    │ Vertical      │ Diagonal    │
    │ ┌───────────┐ │ ┌───────────┐ │ ┌─────────┐│
    │ │ -1 -1 -1  │ │ │ -1  0  1  │ │ │ 1  0 -1 ││
    │ │  0  0  0  │ │ │ -1  0  1  │ │ │ 0  1  0 ││
    │ │  1  1  1  │ │ │ -1  0  1  │ │ │-1  0  1 ││
    │ └───────────┘ │ └───────────┘ │ └─────────┘│
    ──────────────────────────────────────────────

Convolutional Neural Networks (CNNs)

CNNs are the backbone of computer vision. Instead of processing every pixel independently, they use convolutions to detect local patterns that compose into complex features.


    CNN Feature Hierarchy
    ──────────────────────────────────────────────
    │                                             │
    │  Input Image                                │
    │       │                                     │
    │       ▼                                     │
    │  Layer 1-3: Edges, corners, textures        │
    │  ┌─────────────────────────────────┐        │
    │  │ ╱ ─ │ ╲ │ ─ ╱ │ ─ │ ╲ ─ ╱    │        │
    │  └─────────────────────────────────┘        │
    │       │                                     │
    │       ▼                                     │
    │  Layer 4-6: Parts (eyes, wheels, leaves)    │
    │  ┌─────────────────────────────────┐        │
    │  │ ◯   ◯   ⬡   ◯   ⬡   ◯        │        │
    │  └─────────────────────────────────┘        │
    │       │                                     │
    │       ▼                                     │
    │  Layer 7-9: Whole objects (faces, cars)     │
    │  ┌─────────────────────────────────┐        │
    │  │ 🐱   🚗   🌳   🏠   👤        │        │
    │  └─────────────────────────────────┘        │
    │                                             │
    │  Each layer builds on the previous          │
    ──────────────────────────────────────────────

The magic of CNNs: they learn features automatically. You don't need to tell the model to look for edges — it discovers that edges are useful on its own.

Key CNN Components

Convolution layer: Slides small filters across the image, computing dot products at each position. Each filter detects a specific pattern.

Pooling layer: Reduces spatial dimensions while keeping important features. Max pooling takes the maximum value in each region — keeps the strongest activation.

Batch normalization: Normalizes activations between layers. Speeds up training and stabilizes learning.

Global Average Pooling: Averages each feature map into a single number. Replaces fully connected layers and reduces parameters dramatically.

Classic CNN Architectures

Several architectures have shaped the field:


    Architecture Evolution
    ──────────────────────────────────────────────
    │ Year │ Model        │ Layers │ Top-5 Acc │
    ──────────────────────────────────────────────
    │ 2012 │ AlexNet      │ 8      │ 83.6%     │
    │ 2014 │ VGGNet       │ 19     │ 92.7%     │
    │ 2014 │ GoogLeNet    │ 22     │ 93.3%     │
    │ 2015 │ ResNet       │ 152    │ 96.4%     │
    │ 2017 │ DenseNet     │ 264    │ 96.5%     │
    │ 2019 │ EfficientNet │  ~     │ 97.1%     │
    │ 2020 │ ViT          │  ~     │ 97.5%+    │
    ──────────────────────────────────────────────
    │ Trend: Deeper → Better (until diminishing    │
    │ returns → then architectural innovation)     │
    ──────────────────────────────────────────────

ResNet introduced skip connections, allowing training of very deep networks. EfficientNet systematically scales model size, resolution, and depth. ViT applies pure Transformers to images, showing attention mechanisms work beyond text.

Practical Applications

Computer vision is everywhere:

Medical imaging: Detecting tumors, analyzing X-rays, segmenting organs. AI can match or exceed radiologists in specific tasks.

Autonomous vehicles: Understanding the road — detecting lanes, pedestrians, signs, other vehicles in real-time.

Manufacturing: Quality control — detecting defects in products on assembly lines faster than human inspectors.

Augmented reality: Understanding the environment to overlay digital content — like measuring furniture with your phone camera.

The field moves fast. What was cutting-edge two years ago is now a commodity. Start with pre-trained models and transfer learning — it's rarely worth training from scratch.

🧪 Quick Quiz

What is the main task of computer vision?

← Previous Text Classification & Sentiment

Next → Object Detection & Image Segmentation