Decision Trees & Random Forest

Tree-based models for classification and regression.

Decision Trees and Random Forest

Decision trees are intuitive — they ask a series of yes/no questions to classify data, like a flowchart. Random forests take this further by combining hundreds of trees for better accuracy. They're one of the most popular algorithms in practice.

How Decision Trees Work

A decision tree splits data based on feature values. At each node, it picks the split that best separates the classes. The result is a tree structure that you can actually visualize and interpret.


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

np.random.seed(42)
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] - X[:, 2] greater than 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

preds = tree.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

The max_depth parameter limits tree growth. Without it, trees grow until every leaf is pure — which usually means overfitting. Keep trees shallow for better generalization.

Try it Yourself to

Random Forest

A single decision tree is prone to overfitting. A random forest builds many trees on random subsets of data and features, then averages their predictions. This reduces overfitting dramatically.


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

np.random.seed(42)
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] - X[:, 2] greater than 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

preds = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, preds):.4f}")

More trees generally means better performance, but with diminishing returns. 100-200 trees is usually enough. The max_depth still matters — even random forests overfit with unlimited depth.

Try it Yourself to

Feature Importance

Random forests tell you which features matter most. This is incredibly useful for understanding your data and selecting relevant variables.


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(42)
feature_names = ["temperature", "humidity", "wind", "pressure"]
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] greater than 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

importances = rf.feature_importances_
for name, imp in zip(feature_names, importances):
    print(f"{name}: {imp:.4f}")

Feature importance scores sum to 1. Higher values mean the feature contributes more to predictions. Drop low-importance features to simplify your model.

Try it Yourself to

Key Takeaways

Decision trees are interpretable but prone to overfitting
Random forests combine many trees for better generalization
Feature importance reveals which variables drive predictions
Limit max_depth to prevent overfitting in both approaches

← Previous Logistic Regression

Next → Clustering with K-Means