Feature Engineering

Transforming raw data into meaningful inputs.

Feature Engineering

Feature engineering is the art and science of transforming raw data into meaningful inputs for your models. It's often said that better features beat better algorithms. A simple model with great features will outperform a complex model with mediocre features.

Think of it like cooking: the quality of your ingredients matters more than how fancy your kitchen equipment is.

Feature Selection

Not all features are useful. Some are irrelevant, redundant, or harmful. Feature selection keeps the good stuff and throws out the noise.

Filter methods: Evaluate features independently of the model. Use correlation coefficients, chi-squared tests, or mutual information. Fast but ignores feature interactions.

Wrapper methods: Use the model itself to evaluate feature subsets. Try different combinations and measure performance. Accurate but computationally expensive.

Embedded methods: The model selects features during training. L1 regularization (Lasso) drives irrelevant feature weights to zero. Fast and effective.


    Feature Selection Strategies
    ──────────────────────────────────────────────
    │                                             │
    │  Filter (fast)     Wrapper (accurate)       │
    │  ┌─────────┐       ┌─────────────┐         │
    │  │ Correl- │       │ Try subset  │         │
    │  │ ation   │       │ Train model │         │
    │  │ matrix  │       │ Measure acc │         │
    │  │ Remove  │       │ Repeat      │         │
    │  │ low corr│       │ Find best   │         │
    │  └─────────┘       └─────────────┘         │
    │       │                 │                   │
    │       └────────┬────────┘                   │
    │                ▼                            │
    │           Embedded                         │
    │           ┌──────────┐                     │
    │           │ L1 (Lasso│                     │
    │           │ weights  │                     │
    │           │ → zero   │                     │
    │           └──────────┘                     │
    ──────────────────────────────────────────────

Categorical Encoding

Machine learning models need numbers, but many features are categories — "red", "blue", "green" or "small", "medium", "large". How do you convert these?

Label encoding: Assign integers. red=0, blue=1, green=2. Simple but implies ordering (blue > red?), which may not exist.

One-hot encoding: Create binary columns for each category. red=[1,0,0], blue=[0,1,0], green=[0,0,1]. No false ordering, but creates many columns for high-cardinality features.

Target encoding: Replace category with the mean target value. For "color": average sale price for red, blue, green items. Handles high cardinality but risks overfitting.


    Encoding Comparison
    ──────────────────────────────────────────────
    │ Original: ["red", "blue", "green", "red"]   │
    │                                             │
    │ Label:    [0, 1, 2, 0]                      │
    │                                             │
    │ One-Hot:  [[1,0,0],                         │
    │            [0,1,0],                         │
    │            [0,0,1],                         │
    │            [1,0,0]]                         │
    │                                             │
    │ Target:   [250, 310, 280, 250]  ← prices   │
    ──────────────────────────────────────────────

Normalization & Scaling

Features on different scales confuse many algorithms. A salary of 50,000 and an age of 30 — the salary dominates simply because the numbers are bigger.

Min-Max scaling: Rescale to [0,1]. New = (old - min) / (max - min). Preserves distribution shape but sensitive to outliers.

Standardization (Z-score): Transform to mean=0, std=1. New = (old - mean) / std. Less sensitive to outliers. Default choice for most algorithms.

Robust scaling: Uses median and IQR instead of mean and std. Best when you have significant outliers.


    Scaling Example
    ──────────────────────────────────────────────
    │ Feature: Salary ($30k - $150k)              │
    │                                             │
    │ Raw:      [30000, 50000, 80000, 150000]    │
    │ MinMax:   [0.0, 0.17, 0.42, 1.0]          │
    │ Z-Score:  [-1.2, -0.6, 0.2, 1.6]          │
    │                                             │
    │ Feature: Age (18 - 65)                      │
    │                                             │
    │ Raw:      [18, 25, 35, 65]                 │
    │ MinMax:   [0.0, 0.15, 0.36, 1.0]          │
    │ Z-Score:  [-1.3, -0.7, 0.1, 1.9]          │
    │                                             │
    │ After scaling: both features have           │
    │ similar ranges and equal influence          │
    ──────────────────────────────────────────────

Dimensionality Reduction

Too many features cause the "curse of dimensionality" — models overfit, training slows, and visualization becomes impossible. Dimensionality reduction finds a lower-dimensional representation that preserves the important information.

PCA (Principal Component Analysis): Find orthogonal directions of maximum variance. Projects data onto these "principal components." Linear, fast, and widely used.

t-SNE: Preserves local structure — nearby points stay nearby. Great for visualization but not for preprocessing (computationally expensive, doesn't generalize to new data).

UMAP: Like t-SNE but faster and better preserves global structure. Increasingly popular for visualization and dimensionality reduction.


    PCA Intuition
    ──────────────────────────────────────────────
    │                                             │
    │  2D Data:          1D Projection:           │
    │                                             │
    │  ● ● ●             ● ● ●                   │
    │    ● ● ●     →       ● ● ●                 │
    │      ● ● ●             ● ● ●               │
    │        ● ● ●             ● ● ●             │
    │                                             │
    │  Find direction of     Project onto that    │
    │  maximum variance      direction            │
    │                                             │
    │  PC1 = main variation direction             │
    │  PC2 = orthogonal to PC1                    │
    ──────────────────────────────────────────────

Handling Missing Data

Real data is messy. Customers skip survey questions. Sensors fail. Records get corrupted. How you handle missing values matters:

Deletion: Remove rows or columns with missing values. Simple but wastes data. Only safe if data is Missing Completely At Random (MCAR).

Mean/Median/Mode imputation: Fill with the column average. Quick and dirty. Works okay for small amounts of missing data but distorts distributions.

KNN imputation: Find the K most similar complete records and use their values. Preserves relationships between features.

Model-based imputation: Train a model to predict missing values from other features. Most accurate but most complex.

Feature Creation

Sometimes the raw features aren't enough. Creating new features from existing ones can dramatically improve performance:

Domain knowledge: If predicting house prices, create "price per square foot" from price and area. If predicting churn, create "days since last purchase."

Interaction features: Multiply two features together. Height × Width = Area. Sometimes the combination matters more than individual features.

Polynomial features: Create squared or cubed terms. Captures non-linear relationships in linear models.

Date features: Extract day of week, month, season, is_weekend from timestamps. Patterns often depend on time.


    Feature Engineering Ideas
    ──────────────────────────────────────────────
    │ Raw Features        │ Created Features     │
    ──────────────────────────────────────────────
    │ height, width       │ area, aspect_ratio   │
    │ price, sqft         │ price_per_sqft       │
    │ date, time          │ day_of_week, hour    │
    │ first_name          │ name_length, gender  │
    │ text                │ word_count, caps_pct │
    │ latitude, longitude │ distance_to_center   │
    ──────────────────────────────────────────────

Practical Workflow

Feature engineering is iterative. Here's a practical approach:

Start by understanding your data — visualize distributions, check for missing values, look at correlations. Create obvious features from domain knowledge. Try simple models first. Look at what the model gets wrong. Engineer features to address those failures. Repeat.

The best feature engineers combine technical skill with domain expertise. They ask "what would a human expert look at?" and encode that knowledge into features the model can use.

🧪 Quick Quiz

Why is feature engineering important in ML?

← Previous Object Detection & Image Segmentation

Next → Deploying ML Models