Computer Vision Basics
Computer vision teaches machines to "see" β interpreting images and videos the way humans do. It's behind facial recognition, self-driving cars, medical imaging, and Instagram filters.
An image is just a grid of numbers. A 224Γ224 color image is a 224Γ224Γ3 tensor (height Γ width Γ RGB channels). Each pixel is a number from 0 to 255. The challenge is extracting meaning from these numbers.
Image Representation
Before any processing, you need to understand how images are stored:
Image as Numbers
ββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Grayscale: 2D matrix (H Γ W) β
β ββββββββββββββββββββββ β
β β 142 89 34 200 β β pixel values β
β β 67 180 95 178 β (0 = black β
β β 234 120 45 156 β 255 = white) β
β ββββββββββββββββββββββ β
β β
β Color: 3D tensor (H Γ W Γ 3) β
β ββββββββββββββββββββββ β
β β R: [142, 89, ...] β β three channels β
β β G: [67, 180, ...] β stacked β
β β B: [234, 120, ...] β β
β ββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββ
Image Preprocessing
Neural networks are picky about input. Standard preprocessing steps:
Resizing: All images must be the same size. Common sizes: 224Γ224, 256Γ256, 299Γ299.
Normalization: Scale pixel values to [0,1] or standardize to have mean 0 and std 1. Helps training converge faster.
Data augmentation: Artificially increase dataset size by applying random transformations β rotations, flips, crops, color jitter. Helps prevent overfitting.
Data Augmentation Examples
ββββββββββββββββββββββββββββββββββββββββββββββ
β Original β Flipped β Rotated β Croppedβ
β ββββββββ β ββββββββ β ββββββββ βββββββ β
β β π± β β β π± β β β /π± β ββ π± β β
β β β β β β β β / β ββ β β
β ββββββββ β ββββββββ β ββββββββ βββββββ β
β β
β One image β many training samples β
ββββββββββββββββββββββββββββββββββββββββββββββ
Feature Extraction
Before deep learning, engineers hand-crafted features to detect edges, corners, and textures. While outdated for most tasks, understanding these helps appreciate what CNNs learn automatically.
Edge detection: Identifying boundaries where intensity changes sharply. Simple convolution kernels can detect horizontal, vertical, or diagonal edges.
Corner detection: Finding points where edges meet. Corners are distinctive and useful for matching features across images.
Histogram of Oriented Gradients (HOG): Captures shape information by counting edge directions in local regions. Was the go-to for pedestrian detection before CNNs.
Edge Detection Kernels
ββββββββββββββββββββββββββββββββββββββββββββββ
β Horizontal β Vertical β Diagonal β
β βββββββββββββ β βββββββββββββ β ββββββββββββ
β β -1 -1 -1 β β β -1 0 1 β β β 1 0 -1 ββ
β β 0 0 0 β β β -1 0 1 β β β 0 1 0 ββ
β β 1 1 1 β β β -1 0 1 β β β-1 0 1 ββ
β βββββββββββββ β βββββββββββββ β ββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββ
Convolutional Neural Networks (CNNs)
CNNs are the backbone of computer vision. Instead of processing every pixel independently, they use convolutions to detect local patterns that compose into complex features.
CNN Feature Hierarchy
ββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Input Image β
β β β
β βΌ β
β Layer 1-3: Edges, corners, textures β
β βββββββββββββββββββββββββββββββββββ β
β β β± β β β² β β β± β β β β² β β± β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Layer 4-6: Parts (eyes, wheels, leaves) β
β βββββββββββββββββββββββββββββββββββ β
β β ⯠⯠⬑ ⯠⬑ β― β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Layer 7-9: Whole objects (faces, cars) β
β βββββββββββββββββββββββββββββββββββ β
β β π± π π³ π π€ β β
β βββββββββββββββββββββββββββββββββββ β
β β
β Each layer builds on the previous β
ββββββββββββββββββββββββββββββββββββββββββββββ
The magic of CNNs: they learn features automatically. You don't need to tell the model to look for edges β it discovers that edges are useful on its own.
Key CNN Components
Convolution layer: Slides small filters across the image, computing dot products at each position. Each filter detects a specific pattern.
Pooling layer: Reduces spatial dimensions while keeping important features. Max pooling takes the maximum value in each region β keeps the strongest activation.
Batch normalization: Normalizes activations between layers. Speeds up training and stabilizes learning.
Global Average Pooling: Averages each feature map into a single number. Replaces fully connected layers and reduces parameters dramatically.
Classic CNN Architectures
Several architectures have shaped the field:
Architecture Evolution
ββββββββββββββββββββββββββββββββββββββββββββββ
β Year β Model β Layers β Top-5 Acc β
ββββββββββββββββββββββββββββββββββββββββββββββ
β 2012 β AlexNet β 8 β 83.6% β
β 2014 β VGGNet β 19 β 92.7% β
β 2014 β GoogLeNet β 22 β 93.3% β
β 2015 β ResNet β 152 β 96.4% β
β 2017 β DenseNet β 264 β 96.5% β
β 2019 β EfficientNet β ~ β 97.1% β
β 2020 β ViT β ~ β 97.5%+ β
ββββββββββββββββββββββββββββββββββββββββββββββ
β Trend: Deeper β Better (until diminishing β
β returns β then architectural innovation) β
ββββββββββββββββββββββββββββββββββββββββββββββ
ResNet introduced skip connections, allowing training of very deep networks. EfficientNet systematically scales model size, resolution, and depth. ViT applies pure Transformers to images, showing attention mechanisms work beyond text.
Practical Applications
Computer vision is everywhere:
Medical imaging: Detecting tumors, analyzing X-rays, segmenting organs. AI can match or exceed radiologists in specific tasks.
Autonomous vehicles: Understanding the road β detecting lanes, pedestrians, signs, other vehicles in real-time.
Manufacturing: Quality control β detecting defects in products on assembly lines faster than human inspectors.
Augmented reality: Understanding the environment to overlay digital content β like measuring furniture with your phone camera.
The field moves fast. What was cutting-edge two years ago is now a commodity. Start with pre-trained models and transfer learning β it's rarely worth training from scratch.