Random Forest: Ensembling, Bagging, and Feature Importance
Definition
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions (classification) or mean prediction (regression). The algorithm combines two powerful techniques: Bagging (Bootstrap Aggregating), which trains each tree on a random subset of data with replacement, and Random Subspace Method, which considers only a random subset of features at each split. This dual randomness decorrelates trees, reducing variance while maintaining low bias. Leo Breiman's 2001 introduction established Random Forest as a robust, off-the-shelf classifier requiring minimal hyperparameter tuning. The method naturally provides feature importance measures through mean decrease in impurity or permutation importance, and out-of-bag (OOB) error estimation eliminates the need for cross-validation. With built-in handling of missing values, robustness to outliers, and resistance to overfitting, Random Forest has become a benchmark algorithm for tabular data.
Intuition
Imagine asking 100 experts for their opinion on a complex question, where each expert only sees partial information. Random Forest creates a 'wisdom of crowds' effect: individual trees might make mistakes, but the collective vote smooths out errors. Just as diverse viewpoints lead to better decisions, the randomness in each tree ensures they capture different aspects of the data.
Mathematical Formula
Step-by-Step Explanation:
- Each tree is trained on a bootstrap sample - random selection with replacement
- m features considered at each split introduces diversity among trees
- Final prediction aggregates individual tree predictions (vote or average)
- OOB error uses predictions from trees that didn't see each training sample
- Feature importance sums impurity reduction across all splits using that feature
- The square root heuristic for m balances tree strength and diversity
Real-World Use Cases
Credit scoring and fraud detection: combines hundreds of weak signals into robust predictions while providing feature importance for regulatory explanation.
Disease risk prediction from EHR data: handles missing values common in medical records and identifies key biomarkers.
Product recommendation and churn prediction: naturally handles mixed categorical and numerical customer features.
Land cover classification from satellite imagery: ensembles handle noise and sensor variations across geographic regions.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from collections import Counter
from decision_tree import DecisionTree # Assumes previous DecisionTree class
class RandomForest:
"""
Manual implementation of Random Forest classifier.
Demonstrates bagging and feature subsampling principles.
"""
def __init__(self, n_estimators=100, max_depth=10, max_features='sqrt',
min_samples_split=2, random_state=None):
self.n_estimators = n_estimators
self.max_depth = max_depth
self.max_features = max_features
self.min_samples_split = min_samples_split
self.random_state = random_state
self.trees = []
self.feature_indices = []
def _bootstrap_sample(self, X, y):
"""Create a bootstrap sample with replacement."""
n_samples = X.shape[0]
indices = np.random.choice(n_samples, size=n_samples, replace=True)
return X[indices], y[indices]
def _get_max_features(self, n_features):
"""Determine number of features to consider at each split."""
if self.max_features == 'sqrt':
return int(np.sqrt(n_features))
elif self.max_features == 'log2':
return int(np.log2(n_features))
elif isinstance(self.max_features, int):
return self.max_features
elif isinstance(self.max_features, float):
return int(self.max_features * n_features)
else:
return n_features
def fit(self, X, y):
"""Train the random forest."""
np.random.seed(self.random_state)
n_samples, n_features = X.shape
max_features = self._get_max_features(n_features)
self.trees = []
self.feature_indices = []
for i in range(self.n_estimators):
# Bootstrap sampling
X_boot, y_boot = self._bootstrap_sample(X, y)
# Random feature selection for this tree
features = np.random.choice(
n_features, size=max_features, replace=False
)
self.feature_indices.append(features)
# Train tree on bootstrap sample with selected features
X_subset = X_boot[:, features]
tree = DecisionTree(
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
criterion='gini'
)
tree.fit(X_subset, y_boot)
self.trees.append(tree)
return self
def predict(self, X):
"""Predict using majority voting."""
predictions = np.zeros((X.shape[0], self.n_estimators))
for i, (tree, features) in enumerate(zip(self.trees, self.feature_indices)):
predictions[:, i] = tree.predict(X[:, features])
# Majority vote
return np.array([
Counter(predictions[i]).most_common(1)[0][0]
for i in range(X.shape[0])
])
def predict_proba(self, X):
"""Predict class probabilities."""
predictions = np.zeros((X.shape[0], self.n_estimators))
for i, (tree, features) in enumerate(zip(self.trees, self.feature_indices)):
predictions[:, i] = tree.predict(X[:, features])
# Calculate probabilities
classes = np.unique(predictions)
probas = np.zeros((X.shape[0], len(classes)))
for i, cls in enumerate(classes):
probas[:, i] = np.mean(predictions == cls, axis=1)
return probas
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Generate data
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5,
n_redundant=2, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Custom implementation
rf_custom = RandomForest(
n_estimators=50, max_depth=5, max_features='sqrt', random_state=42
)
rf_custom.fit(X_train, y_train)
# Compare with sklearn
rf_sklearn = RandomForestClassifier(
n_estimators=50, max_depth=5, max_features='sqrt', random_state=42
)
rf_sklearn.fit(X_train, y_train)
print(f'Custom RF Accuracy: {np.mean(rf_custom.predict(X_test) == y_test):.3f}')
print(f'Sklearn RF Accuracy: {rf_sklearn.score(X_test, y_test):.3f}')
Using Libraries (scikit-learn, numpy, pandas, matplotlib)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.metrics import (
accuracy_score, classification_report, confusion_matrix,
mean_squared_error, r2_score
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# CLASSIFICATION: Breast Cancer Dataset
print('=== RANDOM FOREST CLASSIFICATION ===')
data = load_breast_cancer()
X_cls, y_cls = data.data, data.target
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_cls, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)
# Train Random Forest
rf_clf = RandomForestClassifier(
n_estimators=200, # Number of trees
max_depth=10, # Maximum tree depth
min_samples_split=5, # Min samples to split a node
min_samples_leaf=2, # Min samples in leaf node
max_features='sqrt', # Features per split (sqrt(n_features))
bootstrap=True, # Use bootstrap sampling
oob_score=True, # Calculate out-of-bag error
n_jobs=-1, # Use all CPU cores
random_state=42
)
rf_clf.fit(X_train_c, y_train_c)
# Predictions and evaluation
y_pred_c = rf_clf.predict(X_test_c)
y_proba_c = rf_clf.predict_proba(X_test_c)[:, 1]
print(f'Accuracy: {accuracy_score(y_test_c, y_pred_c):.3f}')
print(f'OOB Score: {rf_clf.oob_score_:.3f}')
print('
Classification Report:')
print(classification_report(y_test_c, y_pred_c, target_names=data.target_names))
# Feature importance
importance_df = pd.DataFrame({
'feature': data.feature_names,
'importance': rf_clf.feature_importances_
}).sort_values('importance', ascending=False)
print('
Top 10 Most Important Features:')
print(importance_df.head(10))
# REGRESSION: California Housing
print('
=== RANDOM FOREST REGRESSION ===')
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)
rf_reg = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_split=5,
max_features='auto', # n_features/3 for regression
oob_score=True,
n_jobs=-1,
random_state=42
)
rf_reg.fit(X_train_r, y_train_r)
y_pred_r = rf_reg.predict(X_test_r)
print(f'R² Score: {r2_score(y_test_r, y_pred_r):.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_r)):.3f}')
print(f'OOB Score: {rf_reg.oob_score_:.3f}')
# HYPERPARAMETER TUNING
print('
=== HYPERPARAMETER TUNING ===')
param_dist = {
'n_estimators': [100, 200, 500],
'max_depth': [5, 10, 15, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=20,
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train_c, y_train_c)
print(f'Best parameters: {random_search.best_params_}')
print(f'Best CV ROC-AUC: {random_search.best_score_:.3f}')
# Permutation importance (more reliable than MDI)
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(
rf_clf, X_test_c, y_test_c, n_repeats=10, random_state=42
)
perm_df = pd.DataFrame({
'feature': data.feature_names,
'importance_mean': perm_importance.importances_mean,
'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
print('
Permutation Importance (Top 5):')
print(perm_df.head())
When to Use
✅ Appropriate Use Cases:
- Tabular data with mixed feature types (numerical and categorical)
- When you need feature importance for interpretability
- Default choice for classification/regression before trying complex methods
- Missing values in data (sklearn RF handles missing via surrogate splits)
- When you need out-of-bag error estimation (no separate validation needed)
- Robustness to outliers and noise is important
- Non-linear relationships without feature engineering
❌ Avoid When:
- Very large datasets where training time is critical (use XGBoost/LightGBM)
- Real-time predictions with latency constraints (linear models are faster)
- When you need exact probability calibration (RF probabilities are biased)
- High-dimensional sparse data (text, genomics): try linear models first
- When interpretability of individual predictions is required
- Time series with temporal dependencies (use specialized models)
Common Pitfalls
- Not tuning max_depth: Default None leads to fully grown, overfitting trees
- Ignoring class imbalance: Use class_weight='balanced' for skewed data
- Over-relying on feature importance: MDI can be biased toward high-cardinality features
- Not using OOB score: Wastes data on validation when OOB provides unbiased estimate
- Setting n_estimators too low: Need enough trees for stable predictions (100+)
- Forgetting random_state: Results not reproducible
- Not using parallel processing: n_jobs=-1 significantly speeds training