How Do You Know If Your Model Works?
Building a model is only half the battle. You need to evaluate its performance rigorously. A model that memorizes training data but fails on new data is useless β that's called overfitting. This lesson covers the key metrics and techniques for evaluating ML models.
Train/Test Split
The most basic evaluation technique: split your data into a training set (usually 70-80%) and a test set (20-30%). Train on the training set, evaluate on the test set. The test set acts as a proxy for "unseen real-world data."
βββββββββββββββββββββββββββββββββββββββββββββββ
β Full Dataset β
β βββββββββββββββββββββββββ¬ββββββββββββββββββ β
β β Training Set β Test Set β β
β β (80%) β (20%) β β
β β β β β
β β Model learns from β Model is β β
β β this data β evaluated on β β
β βββββββββββββββββββββββββ΄ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Regression Metrics
- MAE (Mean Absolute Error) β Average of absolute differences between predictions and actual values. Easy to interpret.
- MSE (Mean Squared Error) β Average of squared differences. Penalizes large errors more heavily.
- RMSE (Root Mean Squared Error) β Square root of MSE. Back in the original units of the target variable.
- RΒ² Score β Proportion of variance explained by the model. 1.0 is perfect, 0.0 means the model is no better than predicting the mean.
Classification Metrics
Confusion Matrix:
βββββββββββββββββββββββββββββββββββββββββββββββ
β Predicted β
β Positive Negative β
β βββββββββββ¬ββββββββββ¬βββββββββββ β
β βActual β TP β FN β β
β βPositive β (True +)β (False -)β β
β βββββββββββΌββββββββββΌβββββββββββ€ β
β βActual β FP β TN β β
β βNegative β(False +)β (True -) β β
β βββββββββββ΄ββββββββββ΄βββββββββββ β
β β
β Accuracy = (TP + TN) / Total β
β Precision = TP / (TP + FP) β
β Recall = TP / (TP + FN) β
β F1 Score = 2 Γ (Precision Γ Recall) β
β / (Precision + Recall) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Accuracy is the percentage of correct predictions. But it can be misleading for imbalanced datasets β if 99% of emails are not spam, a model that predicts "not spam" for everything gets 99% accuracy but catches zero spam.
Precision measures how many of the positive predictions were correct. Recall measures how many actual positives were found. F1 Score balances both.
Cross-Validation
Cross-validation gives a more robust evaluation than a single train/test split. K-Fold cross-validation splits data into K folds, trains on K-1 folds, and tests on the remaining fold. It repeats this K times, rotating the test fold. The average performance across all folds is your estimate.