Labs ICT
⭐ Pro Login

Model Evaluation & Metrics

How to measure if your model actually works.

How Do You Know If Your Model Works?

Building a model is only half the battle. You need to evaluate its performance rigorously. A model that memorizes training data but fails on new data is useless β€” that's called overfitting. This lesson covers the key metrics and techniques for evaluating ML models.

Train/Test Split

The most basic evaluation technique: split your data into a training set (usually 70-80%) and a test set (20-30%). Train on the training set, evaluate on the test set. The test set acts as a proxy for "unseen real-world data."


  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚           Full Dataset                      β”‚
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
  β”‚ β”‚    Training Set       β”‚   Test Set      β”‚ β”‚
  β”‚ β”‚    (80%)              β”‚   (20%)         β”‚ β”‚
  β”‚ β”‚                       β”‚                 β”‚ β”‚
  β”‚ β”‚  Model learns from    β”‚ Model is        β”‚ β”‚
  β”‚ β”‚  this data            β”‚ evaluated on    β”‚ β”‚
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Regression Metrics

  • MAE (Mean Absolute Error) β€” Average of absolute differences between predictions and actual values. Easy to interpret.
  • MSE (Mean Squared Error) β€” Average of squared differences. Penalizes large errors more heavily.
  • RMSE (Root Mean Squared Error) β€” Square root of MSE. Back in the original units of the target variable.
  • RΒ² Score β€” Proportion of variance explained by the model. 1.0 is perfect, 0.0 means the model is no better than predicting the mean.

Classification Metrics


  Confusion Matrix:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                Predicted                    β”‚
  β”‚            Positive    Negative             β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
  β”‚  β”‚Actual   β”‚  TP     β”‚   FN     β”‚          β”‚
  β”‚  β”‚Positive β”‚ (True +)β”‚ (False -)β”‚          β”‚
  β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€          β”‚
  β”‚  β”‚Actual   β”‚  FP     β”‚   TN     β”‚          β”‚
  β”‚  β”‚Negative β”‚(False +)β”‚ (True -) β”‚          β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
  β”‚                                             β”‚
  β”‚  Accuracy  = (TP + TN) / Total             β”‚
  β”‚  Precision = TP / (TP + FP)                β”‚
  β”‚  Recall    = TP / (TP + FN)                β”‚
  β”‚  F1 Score  = 2 Γ— (Precision Γ— Recall)      β”‚
  β”‚                   / (Precision + Recall)    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Accuracy is the percentage of correct predictions. But it can be misleading for imbalanced datasets β€” if 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy but catches zero spam.

Precision measures how many of the positive predictions were correct. Recall measures how many actual positives were found. F1 Score balances both.

Cross-Validation

Cross-validation gives a more robust evaluation than a single train/test split. K-Fold cross-validation splits data into K folds, trains on K-1 folds, and tests on the remaining fold. It repeats this K times, rotating the test fold. The average performance across all folds is your estimate.

πŸ§ͺ Quick Quiz

What is the purpose of a test set in model evaluation?