Model Evaluation & Metrics

How to measure if your model actually works.

How Do You Know If Your Model Works?

Building a model is only half the battle. You need to evaluate its performance rigorously. A model that memorizes training data but fails on new data is useless — that's called overfitting. This lesson covers the key metrics and techniques for evaluating ML models.

Train/Test Split

The most basic evaluation technique: split your data into a training set (usually 70-80%) and a test set (20-30%). Train on the training set, evaluate on the test set. The test set acts as a proxy for "unseen real-world data."


  ┌─────────────────────────────────────────────┐
  │           Full Dataset                      │
  │ ┌───────────────────────┬─────────────────┐ │
  │ │    Training Set       │   Test Set      │ │
  │ │    (80%)              │   (20%)         │ │
  │ │                       │                 │ │
  │ │  Model learns from    │ Model is        │ │
  │ │  this data            │ evaluated on    │ │
  │ └───────────────────────┴─────────────────┘ │
  └─────────────────────────────────────────────┘

Regression Metrics

MAE (Mean Absolute Error) — Average of absolute differences between predictions and actual values. Easy to interpret.
MSE (Mean Squared Error) — Average of squared differences. Penalizes large errors more heavily.
RMSE (Root Mean Squared Error) — Square root of MSE. Back in the original units of the target variable.
R² Score — Proportion of variance explained by the model. 1.0 is perfect, 0.0 means the model is no better than predicting the mean.

Classification Metrics


  Confusion Matrix:
  ┌─────────────────────────────────────────────┐
  │                Predicted                    │
  │            Positive    Negative             │
  │  ┌─────────┬─────────┬──────────┐          │
  │  │Actual   │  TP     │   FN     │          │
  │  │Positive │ (True +)│ (False -)│          │
  │  ├─────────┼─────────┼──────────┤          │
  │  │Actual   │  FP     │   TN     │          │
  │  │Negative │(False +)│ (True -) │          │
  │  └─────────┴─────────┴──────────┘          │
  │                                             │
  │  Accuracy  = (TP + TN) / Total             │
  │  Precision = TP / (TP + FP)                │
  │  Recall    = TP / (TP + FN)                │
  │  F1 Score  = 2 × (Precision × Recall)      │
  │                   / (Precision + Recall)    │
  └─────────────────────────────────────────────┘

Accuracy is the percentage of correct predictions. But it can be misleading for imbalanced datasets — if 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy but catches zero spam.

Precision measures how many of the positive predictions were correct. Recall measures how many actual positives were found. F1 Score balances both.

Cross-Validation

Cross-validation gives a more robust evaluation than a single train/test split. K-Fold cross-validation splits data into K folds, trains on K-1 folds, and tests on the remaining fold. It repeats this K times, rotating the test fold. The average performance across all folds is your estimate.

🧪 Quick Quiz

What is the purpose of a test set in model evaluation?

← Previous Reinforcement Learning

Next → Linear Regression