Random Forest: Ensembling, Bagging, and Feature Importance

Definition

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions (classification) or mean prediction (regression). The algorithm combines two powerful techniques: Bagging (Bootstrap Aggregating), which trains each tree on a random subset of data with replacement, and Random Subspace Method, which considers only a random subset of features at each split. This dual randomness decorrelates trees, reducing variance while maintaining low bias. Leo Breiman's 2001 introduction established Random Forest as a robust, off-the-shelf classifier requiring minimal hyperparameter tuning. The method naturally provides feature importance measures through mean decrease in impurity or permutation importance, and out-of-bag (OOB) error estimation eliminates the need for cross-validation. With built-in handling of missing values, robustness to outliers, and resistance to overfitting, Random Forest has become a benchmark algorithm for tabular data.

Intuition

💡

Imagine asking 100 experts for their opinion on a complex question, where each expert only sees partial information. Random Forest creates a 'wisdom of crowds' effect: individual trees might make mistakes, but the collective vote smooths out errors. Just as diverse viewpoints lead to better decisions, the randomness in each tree ensures they capture different aspects of the data.

Mathematical Formula

Bootstrap Sample (with replacement):

T^{(b)} \sim \text{Bootstrap}(T), \quad b = 1, ..., B

Random Feature Subset at each split:

m = \sqrt{p} \text{ (classification) or } m = p/3 \text{ (regression)}

Prediction (Classification - Majority Vote):

\hat{y} = \text{mode}\{h_b(x)\}_{b=1}^{B}

Prediction (Regression - Average):

\hat{y} = \frac{1}{B}\sum_{b=1}^{B} h_b(x)

Out-of-Bag Error:

\hat{err}_{OOB} = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \bar{h}_{OOB}(x_i))

Gini Importance (Mean Decrease Impurity):

Importance(x_j) = \frac{1}{B}\sum_{b=1}^{B} \sum_{t \in T_b^{(j)}} p(t) \Delta i(t)

Step-by-Step Explanation:

Each tree is trained on a bootstrap sample - random selection with replacement
m features considered at each split introduces diversity among trees
Final prediction aggregates individual tree predictions (vote or average)
OOB error uses predictions from trees that didn't see each training sample
Feature importance sums impurity reduction across all splits using that feature
The square root heuristic for m balances tree strength and diversity

Real-World Use Cases

Finance

Credit scoring and fraud detection: combines hundreds of weak signals into robust predictions while providing feature importance for regulatory explanation.

Healthcare

Disease risk prediction from EHR data: handles missing values common in medical records and identifies key biomarkers.

E-commerce

Product recommendation and churn prediction: naturally handles mixed categorical and numerical customer features.

Remote Sensing

Land cover classification from satellite imagery: ensembles handle noise and sensor variations across geographic regions.

Implementation

Manual Implementation (No Libraries)

This implementation shows the core concepts: bootstrap sampling creates diverse training sets, feature subsampling decorrelates trees, and majority voting combines predictions. Each tree sees different data and features, reducing variance through diversity.

import numpy as np
from collections import Counter
from decision_tree import DecisionTree  # Assumes previous DecisionTree class

class RandomForest:
    """
    Manual implementation of Random Forest classifier.
    Demonstrates bagging and feature subsampling principles.
    """
    
    def __init__(self, n_estimators=100, max_depth=10, max_features='sqrt', 
                 min_samples_split=2, random_state=None):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.min_samples_split = min_samples_split
        self.random_state = random_state
        self.trees = []
        self.feature_indices = []
    
    def _bootstrap_sample(self, X, y):
        """Create a bootstrap sample with replacement."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, size=n_samples, replace=True)
        return X[indices], y[indices]
    
    def _get_max_features(self, n_features):
        """Determine number of features to consider at each split."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return self.max_features
        elif isinstance(self.max_features, float):
            return int(self.max_features * n_features)
        else:
            return n_features
    
    def fit(self, X, y):
        """Train the random forest."""
        np.random.seed(self.random_state)
        n_samples, n_features = X.shape
        max_features = self._get_max_features(n_features)
        
        self.trees = []
        self.feature_indices = []
        
        for i in range(self.n_estimators):
            # Bootstrap sampling
            X_boot, y_boot = self._bootstrap_sample(X, y)
            
            # Random feature selection for this tree
            features = np.random.choice(
                n_features, size=max_features, replace=False
            )
            self.feature_indices.append(features)
            
            # Train tree on bootstrap sample with selected features
            X_subset = X_boot[:, features]
            tree = DecisionTree(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                criterion='gini'
            )
            tree.fit(X_subset, y_boot)
            self.trees.append(tree)
        
        return self
    
    def predict(self, X):
        """Predict using majority voting."""
        predictions = np.zeros((X.shape[0], self.n_estimators))
        
        for i, (tree, features) in enumerate(zip(self.trees, self.feature_indices)):
            predictions[:, i] = tree.predict(X[:, features])
        
        # Majority vote
        return np.array([
            Counter(predictions[i]).most_common(1)[0][0] 
            for i in range(X.shape[0])
        ])
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        predictions = np.zeros((X.shape[0], self.n_estimators))
        
        for i, (tree, features) in enumerate(zip(self.trees, self.feature_indices)):
            predictions[:, i] = tree.predict(X[:, features])
        
        # Calculate probabilities
        classes = np.unique(predictions)
        probas = np.zeros((X.shape[0], len(classes)))
        
        for i, cls in enumerate(classes):
            probas[:, i] = np.mean(predictions == cls, axis=1)
        
        return probas

# Demonstration
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    
    # Generate data
    X, y = make_classification(
        n_samples=500, n_features=10, n_informative=5,
        n_redundant=2, random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Custom implementation
    rf_custom = RandomForest(
        n_estimators=50, max_depth=5, max_features='sqrt', random_state=42
    )
    rf_custom.fit(X_train, y_train)
    
    # Compare with sklearn
    rf_sklearn = RandomForestClassifier(
        n_estimators=50, max_depth=5, max_features='sqrt', random_state=42
    )
    rf_sklearn.fit(X_train, y_train)
    
    print(f'Custom RF Accuracy: {np.mean(rf_custom.predict(X_test) == y_test):.3f}')
    print(f'Sklearn RF Accuracy: {rf_sklearn.score(X_test, y_test):.3f}')

Using Libraries (scikit-learn, numpy, pandas, matplotlib)

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# CLASSIFICATION: Breast Cancer Dataset
print('=== RANDOM FOREST CLASSIFICATION ===')
data = load_breast_cancer()
X_cls, y_cls = data.data, data.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)

# Train Random Forest
rf_clf = RandomForestClassifier(
    n_estimators=200,           # Number of trees
    max_depth=10,               # Maximum tree depth
    min_samples_split=5,        # Min samples to split a node
    min_samples_leaf=2,         # Min samples in leaf node
    max_features='sqrt',        # Features per split (sqrt(n_features))
    bootstrap=True,             # Use bootstrap sampling
    oob_score=True,             # Calculate out-of-bag error
    n_jobs=-1,                  # Use all CPU cores
    random_state=42
)
rf_clf.fit(X_train_c, y_train_c)

# Predictions and evaluation
y_pred_c = rf_clf.predict(X_test_c)
y_proba_c = rf_clf.predict_proba(X_test_c)[:, 1]

print(f'Accuracy: {accuracy_score(y_test_c, y_pred_c):.3f}')
print(f'OOB Score: {rf_clf.oob_score_:.3f}')
print('
Classification Report:')
print(classification_report(y_test_c, y_pred_c, target_names=data.target_names))

# Feature importance
importance_df = pd.DataFrame({
    'feature': data.feature_names,
    'importance': rf_clf.feature_importances_
}).sort_values('importance', ascending=False)

print('
Top 10 Most Important Features:')
print(importance_df.head(10))

# REGRESSION: California Housing
print('
=== RANDOM FOREST REGRESSION ===')
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    max_features='auto',       # n_features/3 for regression
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
rf_reg.fit(X_train_r, y_train_r)

y_pred_r = rf_reg.predict(X_test_r)
print(f'R² Score: {r2_score(y_test_r, y_pred_r):.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test_r, y_pred_r)):.3f}')
print(f'OOB Score: {rf_reg.oob_score_:.3f}')

# HYPERPARAMETER TUNING
print('
=== HYPERPARAMETER TUNING ===')
param_dist = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train_c, y_train_c)

print(f'Best parameters: {random_search.best_params_}')
print(f'Best CV ROC-AUC: {random_search.best_score_:.3f}')

# Permutation importance (more reliable than MDI)
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    rf_clf, X_test_c, y_test_c, n_repeats=10, random_state=42
)

perm_df = pd.DataFrame({
    'feature': data.feature_names,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print('
Permutation Importance (Top 5):')
print(perm_df.head())

When to Use

✅ Appropriate Use Cases:

Tabular data with mixed feature types (numerical and categorical)
When you need feature importance for interpretability
Default choice for classification/regression before trying complex methods
Missing values in data (sklearn RF handles missing via surrogate splits)
When you need out-of-bag error estimation (no separate validation needed)
Robustness to outliers and noise is important
Non-linear relationships without feature engineering

❌ Avoid When:

Very large datasets where training time is critical (use XGBoost/LightGBM)
Real-time predictions with latency constraints (linear models are faster)
When you need exact probability calibration (RF probabilities are biased)
High-dimensional sparse data (text, genomics): try linear models first
When interpretability of individual predictions is required
Time series with temporal dependencies (use specialized models)

Common Pitfalls

Not tuning max_depth: Default None leads to fully grown, overfitting trees
Ignoring class imbalance: Use class_weight='balanced' for skewed data
Over-relying on feature importance: MDI can be biased toward high-cardinality features
Not using OOB score: Wastes data on validation when OOB provides unbiased estimate
Setting n_estimators too low: Need enough trees for stable predictions (100+)
Forgetting random_state: Results not reproducible
Not using parallel processing: n_jobs=-1 significantly speeds training