Decision Trees and Random Forest
Decision trees are intuitive — they ask a series of yes/no questions to classify data, like a flowchart. Random forests take this further by combining hundreds of trees for better accuracy. They're one of the most popular algorithms in practice.
How Decision Trees Work
A decision tree splits data based on feature values. At each node, it picks the split that best separates the classes. The result is a tree structure that you can actually visualize and interpret.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(42)
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] - X[:, 2] greater than 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
preds = tree.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
The max_depth parameter limits tree growth. Without it, trees grow until every leaf is pure — which usually means overfitting. Keep trees shallow for better generalization.
Random Forest
A single decision tree is prone to overfitting. A random forest builds many trees on random subsets of data and features, then averages their predictions. This reduces overfitting dramatically.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(42)
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] - X[:, 2] greater than 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
preds = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, preds):.4f}")
More trees generally means better performance, but with diminishing returns. 100-200 trees is usually enough. The max_depth still matters — even random forests overfit with unlimited depth.
Feature Importance
Random forests tell you which features matter most. This is incredibly useful for understanding your data and selecting relevant variables.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
feature_names = ["temperature", "humidity", "wind", "pressure"]
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] greater than 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
for name, imp in zip(feature_names, importances):
print(f"{name}: {imp:.4f}")
Feature importance scores sum to 1. Higher values mean the feature contributes more to predictions. Drop low-importance features to simplify your model.
Try it Yourself toKey Takeaways
- Decision trees are interpretable but prone to overfitting
- Random forests combine many trees for better generalization
- Feature importance reveals which variables drive predictions
- Limit max_depth to prevent overfitting in both approaches