Imbalanced Data: SMOTE, ADASYN, and Class Weights

Advanced Preprocessing

~13 min read Preprocessing

Prerequisites:

Feature Engineering: Polynomials, Interactions, Binning, and Domain Features Feature Scaling: Standardization, Normalization, and Robust Scaling

Definition

Class imbalance occurs when the distribution of target classes in a classification dataset is significantly skewed, with some classes having many more samples than others. This is common in real-world applications like fraud detection (fraud < 1%), disease diagnosis (disease < 5%), and churn prediction (churners < 10%). Standard machine learning algorithms optimize for overall accuracy, which leads to models that simply predict the majority class, ignoring the minority class that often matters most. Imbalanced data handling techniques address this by either modifying the training data (resampling) or the learning algorithm (cost-sensitive learning). Oversampling techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic minority samples by interpolating between existing minority instances. ADASYN (Adaptive Synthetic Sampling) focuses on harder-to-learn minority samples. Undersampling reduces majority class samples. Class weights penalize misclassification of minority classes more heavily. The choice depends on data size, imbalance severity, and the cost of different types of errors.

Intuition

💡

Imagine you're training a security guard to spot shoplifters. If only 1 in 1000 customers steals, the guard could achieve 99.9% accuracy by never accusing anyone—but they'd miss all actual thefts. Standard ML does exactly this with imbalanced data. SMOTE is like showing the guard synthetic examples of shoplifting behavior created by blending features of known thieves—'someone who walks like thief A but dresses like thief B.' ADASYN is smarter: it creates more synthetic examples in the 'hard to catch' areas where thieves look almost like normal shoppers. Class weights are like telling the guard 'mistakenly accusing an innocent person costs $1, but missing a thief costs $1000'—making them more vigilant. The key insight is that accuracy is meaningless with imbalance; you need metrics that reflect actual business costs like precision, recall, and F1-score for the minority class.

Mathematical Formula

\text{Class Weight:} \quad w_j = \frac{n_{samples}}{n_{classes} \times n_j}

\text{SMOTE Sample:} \quad s_{new} = s_i + \lambda \times (s_{nn} - s_i)

\text{where } \lambda \sim U(0, 1), \quad s_{nn} = \text{nearest neighbor}

\text{ADASYN Ratio:} \quad r_i = \frac{\Delta_i}{K}

\text{where } \Delta_i = \text{majority neighbors of } s_i, \quad K = \ ext{total neighbors}

\text{F1-Score:} \quad F_1 = 2 \times \frac{precision \times recall}{precision + recall}

Step-by-Step Explanation:

Class Weight: Inverse frequency weighting—smaller classes get higher weights to penalize their misclassification more
SMOTE: Create synthetic sample by interpolating between minority sample and its k-nearest minority neighbor
ADASYN: Generate more synthetic samples for minority instances that have more majority class neighbors (harder cases)
Balanced Accuracy: Average of recall obtained on each class—better metric than accuracy for imbalanced data
F1-Score: Harmonic mean of precision and recall—balances false positives and false negatives

Real-World Use Cases

Healthcare

Disease screening where disease prevalence is <1%. Missing a disease (false negative) is more costly than false alarm. SMOTE to balance training data, class weights to emphasize sensitivity. Metrics: recall (sensitivity), precision (PPV), AUC-ROC.

Finance

Fraud detection with fraud rate 0.1-1%. Extreme imbalance requires SMOTE or ADASYN. Cost-sensitive learning with high fraud misclassification penalty. Undersampling for large datasets, ensemble methods for small datasets.

Retail

Churn prediction where 5-10% of customers churn. Moderate imbalance handled with class weights or mild SMOTE. Time-based split important—churn patterns change over time. Metrics: precision-recall curve, lift charts.

Manufacturing

Defect detection with defect rate <5%. SMOTE for image/classification data. Anomaly detection as alternative. Cost matrix: defective product shipped >> good product scrapped.

Tech

Bot detection (bots < 10%), click fraud, spam classification. High-dimensional text features need careful SMOTE (distance in sparse space). ADASYN for evolving attack patterns.

Implementation

Manual Implementation (No Libraries)

import numpy as np
import pandas as pd
from collections import Counter

# Create imbalanced dataset
np.random.seed(42)

# Majority class (95%)
X_majority = np.random.randn(950, 2) + np.array([2, 2])
y_majority = np.zeros(950)

# Minority class (5%)
X_minority = np.random.randn(50, 2) + np.array([0, 0])
y_minority = np.ones(50)

# Combine
X = np.vstack([X_majority, X_minority])
y = np.hstack([y_majority, y_minority])

df = pd.DataFrame(X, columns=['feature1', 'feature2'])
df['target'] = y

print("Dataset Distribution:")
print(f"Class 0 (Majority): {(y==0).sum()} ({(y==0).mean()*100:.1f}%)")
print(f"Class 1 (Minority): {(y==1).sum()} ({(y==1).mean()*100:.1f}%)")
print(f"Imbalance Ratio: {(y==0).sum() / (y==1).sum():.1f}:1")

# 1. RANDOM OVERSAMPLING (Manual)
def random_oversample(X, y, random_state=42):
    """
    Manual random oversampling - duplicate minority class samples.
    """
    np.random.seed(random_state)
    
    # Separate classes
    X_min = X[y == 1]
    X_maj = X[y == 0]
    
    n_min = len(X_min)
    n_maj = len(X_maj)
    
    # Randomly sample with replacement from minority
    indices = np.random.choice(n_min, size=n_maj - n_min, replace=True)
    X_min_oversampled = X_min[indices]
    
    # Combine
    X_balanced = np.vstack([X_maj, X_min, X_min_oversampled])
    y_balanced = np.hstack([np.zeros(n_maj), np.ones(n_min), np.ones(n_maj - n_min)])
    
    # Shuffle
    shuffle_idx = np.random.permutation(len(X_balanced))
    
    return X_balanced[shuffle_idx], y_balanced[shuffle_idx]

print("
=== 1. RANDOM OVERSAMPLING (Manual) ===")
X_over, y_over = random_oversample(X, y)
print(f"After oversampling: Class 0={(y_over==0).sum()}, Class 1={(y_over==1).sum()}")

# 2. RANDOM UNDERSAMPLING (Manual)
def random_undersample(X, y, random_state=42):
    """
    Manual random undersampling - remove majority class samples.
    """
    np.random.seed(random_state)
    
    X_min = X[y == 1]
    X_maj = X[y == 0]
    
    n_min = len(X_min)
    
    # Randomly sample from majority without replacement
    indices = np.random.choice(len(X_maj), size=n_min, replace=False)
    X_maj_undersampled = X_maj[indices]
    
    # Combine
    X_balanced = np.vstack([X_maj_undersampled, X_min])
    y_balanced = np.hstack([np.zeros(n_min), np.ones(n_min)])
    
    # Shuffle
    shuffle_idx = np.random.permutation(len(X_balanced))
    
    return X_balanced[shuffle_idx], y_balanced[shuffle_idx]

print("
=== 2. RANDOM UNDERSAMPLING (Manual) ===")
X_under, y_under = random_undersample(X, y)
print(f"After undersampling: Class 0={(y_under==0).sum()}, Class 1={(y_under==1).sum()}")
print(f"Warning: Lost {(y==0).sum() - (y_under==0).sum()} majority samples!")

# 3. SMOTE (Manual Implementation)
def euclidean_distance(x1, x2):
    """Calculate Euclidean distance"""
    return np.sqrt(np.sum((x1 - x2) ** 2))

def get_neighbors(X, sample_idx, k=5):
    """Get k nearest neighbors for a sample"""
    distances = []
    for i in range(len(X)):
        if i != sample_idx:
            dist = euclidean_distance(X[sample_idx], X[i])
            distances.append((i, dist))
    distances.sort(key=lambda x: x[1])
    return [idx for idx, _ in distances[:k]]

def smote_manual(X, y, k=5, random_state=42):
    """
    Manual SMOTE implementation.
    1. For each minority sample
    2. Find k nearest minority neighbors
    3. Randomly select one neighbor
    4. Create synthetic sample between them
    """
    np.random.seed(random_state)
    
    X_min = X[y == 1]
    X_maj = X[y == 0]
    n_min = len(X_min)
    n_maj = len(X_maj)
    
    # Number of synthetic samples needed
    n_synthetic = n_maj - n_min
    
    synthetic_samples = []
    
    for _ in range(n_synthetic):
        # Random minority sample
        idx = np.random.randint(0, n_min)
        sample = X_min[idx]
        
        # Find k nearest neighbors within minority class
        distances = [(i, euclidean_distance(sample, X_min[i])) 
                     for i in range(n_min) if i != idx]
        distances.sort(key=lambda x: x[1])
        neighbors = [X_min[i] for i, _ in distances[:k]]
        
        # Random neighbor
        neighbor = neighbors[np.random.randint(0, len(neighbors))]
        
        # Generate synthetic sample
        alpha = np.random.random()
        synthetic = sample + alpha * (neighbor - sample)
        synthetic_samples.append(synthetic)
    
    # Combine
    X_smote = np.vstack([X_maj, X_min, np.array(synthetic_samples)])
    y_smote = np.hstack([np.zeros(n_maj), np.ones(n_min + n_synthetic)])
    
    # Shuffle
    shuffle_idx = np.random.permutation(len(X_smote))
    
    return X_smote[shuffle_idx], y_smote[shuffle_idx]

print("
=== 3. SMOTE (Manual) ===")
X_smote, y_smote = smote_manual(X, y, k=5)
print(f"After SMOTE: Class 0={(y_smote==0).sum()}, Class 1={(y_smote==1).sum()}")
print(f"Synthetic samples created: {n_synthetic}")

# 4. CLASS WEIGHTS CALCULATION
def calculate_class_weights(y, method='balanced'):
    """
    Calculate class weights for imbalanced data.
    """
    counts = Counter(y)
    n_samples = len(y)
    n_classes = len(counts)
    
    if method == 'balanced':
        # sklearn's balanced formula
        weights = {cls: n_samples / (n_classes * count) 
                   for cls, count in counts.items()}
    elif method == 'sqrt':
        # Square root weighting (less aggressive)
        weights = {cls: np.sqrt(n_samples / (n_classes * count)) 
                   for cls, count in counts.items()}
    elif method == 'log':
        # Logarithmic weighting
        weights = {cls: np.log(n_samples / count) + 1 
                   for cls, count in counts.items()}
    else:
        # Custom ratio-based
        max_count = max(counts.values())
        weights = {cls: max_count / count 
                   for cls, count in counts.items()}
    
    return weights

print("
=== 4. CLASS WEIGHTS ===")
for method in ['balanced', 'sqrt', 'log']:
    weights = calculate_class_weights(y, method=method)
    print(f"
{method}: {weights}")

# 5. EVALUATION METRICS FOR IMBALANCED DATA
def evaluate_classification(y_true, y_pred, y_prob=None):
    """
    Calculate metrics appropriate for imbalanced data.
    """
    from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                                  f1_score, confusion_matrix, roc_auc_score,
                                  average_precision_score)
    
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
        'specificity': tn / (tn + fp),
        'balanced_accuracy': (recall_score(y_true, y_pred) + tn / (tn + fp)) / 2
    }
    
    if y_prob is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_prob)
        metrics['average_precision'] = average_precision_score(y_true, y_prob)
    
    return metrics

print("
=== 5. EVALUATION METRICS ===")
print("Confusion Matrix Components:")
print("  TN: True Negative (correctly predicted majority)")
print("  FP: False Positive (majority predicted as minority)")
print("  FN: False Negative (minority predicted as majority) - COSTLY!")
print("  TP: True Positive (correctly predicted minority)")
print("
Key Metrics for Imbalanced Data:")
print("  - Recall (Sensitivity): TP / (TP + FN) - How many minorities caught?")
print("  - Precision: TP / (TP + FP) - How many predicted minorities are correct?")
print("  - F1-Score: Harmonic mean of precision and recall")
print("  - Balanced Accuracy: Average of recall for each class")
print("  - ROC-AUC: Area under ROC curve (threshold independent)")
print("  - PR-AUC: Area under Precision-Recall curve (better for imbalance)")

# 6. COST MATRIX
def calculate_cost_weighted_accuracy(y_true, y_pred, cost_matrix):
    """
    Calculate cost-sensitive accuracy using a cost matrix.
    """
    cm = confusion_matrix(y_true, y_pred)
    total_cost = np.sum(cm * cost_matrix)
    max_cost = np.sum(cm) * np.max(cost_matrix)
    
    # Normalize to accuracy-like metric (1 - normalized cost)
    cost_weighted_accuracy = 1 - (total_cost / max_cost)
    
    return cost_weighted_accuracy, total_cost

print("
=== 6. COST MATRIX ===")
# Define cost matrix (rows=true, cols=predicted)
# Cost of FN (missing minority) is 10x cost of FP
cost_matrix = np.array([[1, 5],   # TN, FP
                        [50, 1]]) # FN, TP
print("Cost Matrix:")
print("         Pred 0  Pred 1")
print(f"True 0    {cost_matrix[0,0]}      {cost_matrix[0,1]}")
print(f"True 1    {cost_matrix[1,0]}     {cost_matrix[1,1]}")
print("
FN (False Negative) costs 10x more than FP!")

# 7. STRATIFIED SAMPLING
def stratified_split(X, y, test_size=0.2, random_state=42):
    """
    Manual stratified train-test split preserving class proportions.
    """
    np.random.seed(random_state)
    
    X_train, X_test = [], []
    y_train, y_test = [], []
    
    for cls in np.unique(y):
        cls_idx = np.where(y == cls)[0]
        n_test = int(len(cls_idx) * test_size)
        
        test_idx = np.random.choice(cls_idx, size=n_test, replace=False)
        train_idx = np.setdiff1d(cls_idx, test_idx)
        
        X_test.append(X[test_idx])
        y_test.append(y[test_idx])
        X_train.append(X[train_idx])
        y_train.append(y[train_idx])
    
    X_train = np.vstack(X_train)
    y_train = np.hstack(y_train)
    X_test = np.vstack(X_test)
    y_test = np.hstack(y_test)
    
    # Shuffle
    train_idx = np.random.permutation(len(X_train))
    test_idx = np.random.permutation(len(X_test))
    
    return X_train[train_idx], X_test[test_idx], y_train[train_idx], y_test[test_idx]

print("
=== 7. STRATIFIED SPLIT ===")
X_train, X_test, y_train, y_test = stratified_split(X, y, test_size=0.2)
print(f"Train: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"Test: Class 0={(y_test==0).sum()}, Class 1={(y_test==1).sum()}")
print(f"Class proportions preserved!")

print("
=== COMPARISON SUMMARY ===")
methods = {
    'Original': (X, y),
    'Random Oversample': (X_over, y_over),
    'Random Undersample': (X_under, y_under),
    'SMOTE': (X_smote, y_smote)
}
for name, (X_m, y_m) in methods.items():
    c0, c1 = (y_m==0).sum(), (y_m==1).sum()
    print(f"{name}: Class 0={c0}, Class 1={c1}, Ratio={c0/c1:.1f}:1")

Using Libraries ()

import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score,
                             average_precision_score, precision_recall_curve, f1_score,
                             balanced_accuracy_score, recall_score, precision_score)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.95, 0.05],  # 95% majority, 5% minority
    flip_y=0,
    random_state=42
)

print("Dataset Statistics:")
print(f"Total samples: {len(y)}")
print(f"Class 0 (Majority): {(y==0).sum()} ({(y==0).mean()*100:.1f}%)")
print(f"Class 1 (Minority): {(y==1).sum()} ({(y==1).mean()*100:.1f}%)")
print(f"Imbalance Ratio: {(y==0).sum() / (y==1).sum():.1f}:1")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"
Train set: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"Test set: Class 0={(y_test==0).sum()}, Class 1={(y_test==1).sum()}")

# 1. RANDOM OVERSAMPLING
print("
" + "="*60)
print("1. RANDOM OVERSAMPLING")
print("="*60)

ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print(f"Before: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"After: Class 0={(y_train_ros==0).sum()}, Class 1={(y_train_ros==1).sum()}")

# 2. SMOTE
print("
" + "="*60)
print("2. SMOTE (Synthetic Minority Over-sampling)")
print("="*60)

smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Before: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"After: Class 0={(y_train_smote==0).sum()}, Class 1={(y_train_smote==1).sum()}")
print(f"Synthetic samples created: {(y_train_smote==1).sum() - (y_train==1).sum()}")

# SMOTE variants
smote_borderline = SMOTE(random_state=42, k_neighbors=5, kind='borderline1')
X_train_bl, y_train_bl = smote_borderline.fit_resample(X_train, y_train)
print(f"
Borderline-SMOTE: Class 1={(y_train_bl==1).sum()}")

# 3. ADASYN
print("
" + "="*60)
print("3. ADASYN (Adaptive Synthetic Sampling)")
print("="*60)

adasyn = ADASYN(random_state=42, n_neighbors=5)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)

print(f"Before: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"After: Class 0={(y_train_adasyn==0).sum()}, Class 1={(y_train_adasyn==1).sum()}")
print(f"Synthetic samples created: {(y_train_adasyn==1).sum() - (y_train==1).sum()}")
print("ADASYN generates more samples for harder-to-learn minority instances")

# 4. RANDOM UNDERSAMPLING
print("
" + "="*60)
print("4. RANDOM UNDERSAMPLING")
print("="*60)

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print(f"Before: Class 0={(y_train==0).sum()}, Class 1={(y_train==1).sum()}")
print(f"After: Class 0={(y_train_rus==0).sum()}, Class 1={(y_train_rus==1).sum()}")
print(f"Majority samples removed: {(y_train==0).sum() - (y_train_rus==0).sum()}")
print("Warning: May lose valuable information from majority class!")

# 5. COMBINED METHODS
print("
" + "="*60)
print("5. COMBINED METHODS")
print("="*60)

# SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_train_st, y_train_st = smote_tomek.fit_resample(X_train, y_train)
print(f"SMOTE+Tomek: Class 0={(y_train_st==0).sum()}, Class 1={(y_train_st==1).sum()}")
print(f"Tomek links removed: {len(y_train_smote) - len(y_train_st)} borderline samples")

# 6. CLASS WEIGHTS
print("
" + "="*60)
print("6. CLASS WEIGHTS IN SKLEARN")
print("="*60)

# Calculate class weights
from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = {cls: weight for cls, weight in zip(classes, class_weights)}

print(f"Auto-calculated class weights: {class_weight_dict}")
print(f"Class 0 weight: {class_weight_dict[0]:.3f}")
print(f"Class 1 weight: {class_weight_dict[1]:.3f}")
print(f"Ratio: {class_weight_dict[1]/class_weight_dict[0]:.1f}x")

# Custom weights
custom_weights = {0: 1.0, 1: 10.0}  # Penalize minority misclassification 10x
print(f"
Custom weights: {custom_weights}")

# 7. MODEL COMPARISON
print("
" + "="*60)
print("7. MODEL PERFORMANCE COMPARISON")
print("="*60)

def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    return {
        'Model': model_name,
        'Accuracy': model.score(X_test, y_test),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'Balanced_Acc': balanced_accuracy_score(y_test, y_pred),
        'ROC_AUC': roc_auc_score(y_test, y_prob),
        'PR_AUC': average_precision_score(y_test, y_prob)
    }

results = []

# Baseline (No resampling)
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
rf_baseline.fit(X_train, y_train)
results.append(evaluate_model(rf_baseline, X_test, y_test, 'Baseline'))

# With SMOTE
rf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
results.append(evaluate_model(rf_smote, X_test, y_test, 'SMOTE'))

# With ADASYN
rf_adasyn = RandomForestClassifier(n_estimators=100, random_state=42)
rf_adasyn.fit(X_train_adasyn, y_train_adasyn)
results.append(evaluate_model(rf_adasyn, X_test, y_test, 'ADASYN'))

# With Class Weights
rf_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42
)
rf_weighted.fit(X_train, y_train)
results.append(evaluate_model(rf_weighted, X_test, y_test, 'Class Weights'))

# With Undersampling
rf_under = RandomForestClassifier(n_estimators=100, random_state=42)
rf_under.fit(X_train_rus, y_train_rus)
results.append(evaluate_model(rf_under, X_test, y_test, 'Undersample'))

# Display results
results_df = pd.DataFrame(results)
print("
Performance Comparison:")
print(results_df.round(3).to_string(index=False))

# 8. THRESHOLD TUNING
print("
" + "="*60)
print("8. THRESHOLD TUNING")
print("="*60)

# Find optimal threshold
y_prob = rf_weighted.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]

print(f"Default threshold: 0.5")
print(f"Optimal F1 threshold: {best_threshold:.3f}")
print(f"F1 at default: {f1_score(y_test, y_prob >= 0.5):.3f}")
print(f"F1 at optimal: {f1_scores[best_idx]:.3f}")

# 9. CROSS-VALIDATION WITH IMBALANCED DATA
print("
" + "="*60)
print("9. STRATIFIED CROSS-VALIDATION")
print("="*60)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, val_idx in skf.split(X_train, y_train):
    X_tr, X_val = X_train[train_idx], X_train[val_idx]
    y_tr, y_val = y_train[train_idx], y_train[val_idx]
    
    # Apply SMOTE only to training fold
    smote_cv = SMOTE(random_state=42)
    X_tr_res, y_tr_res = smote_cv.fit_resample(X_tr, y_tr)
    
    # Train and evaluate
    rf_cv = RandomForestClassifier(n_estimators=50, random_state=42)
    rf_cv.fit(X_tr_res, y_tr_res)
    
    y_pred = rf_cv.predict(X_val)
    cv_scores.append(f1_score(y_val, y_pred))

print(f"Stratified CV F1 scores: {[round(s, 3) for s in cv_scores]}")
print(f"Mean F1: {np.mean(cv_scores):.3f} (+/- {np.std(cv_scores):.3f})")

# 10. COST-SENSITIVE EVALUATION
print("
" + "="*60)
print("10. COST-SENSITIVE EVALUATION")
print("="*60)

def cost_sensitive_score(y_true, y_pred, cost_fn=10, cost_fp=1):
    """
    Calculate cost-weighted score.
    cost_fn: Cost of False Negative (missing minority)
    cost_fp: Cost of False Positive (false alarm)
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    total_cost = fn * cost_fn + fp * cost_fp
    max_cost = (fn + tp) * cost_fn + (tn + fp) * cost_fp
    
    return 1 - (total_cost / max_cost), total_cost

# Compare different approaches with cost
y_pred_baseline = rf_baseline.predict(X_test)
y_pred_smote = rf_smote.predict(X_test)
y_pred_weighted = rf_weighted.predict(X_test)

for name, y_pred in [('Baseline', y_pred_baseline),
                      ('SMOTE', y_pred_smote),
                      ('Weighted', y_pred_weighted)]:
    cost_acc, total_cost = cost_sensitive_score(y_test, y_pred, cost_fn=10, cost_fp=1)
    print(f"{name}: Cost-Accuracy={cost_acc:.3f}, Total Cost={total_cost}")

# 11. BEST PRACTICES SUMMARY
print("
" + "="*60)
print("11. BEST PRACTICES FOR IMBALANCED DATA")
print("="*60)

best_practices = {
    'Technique': [
        'Never use accuracy',
        'Stratified split',
        'Use appropriate metrics',
        'Try class weights first',
        'SMOTE for small data',
        'ADASYN for complex boundaries',
        'Avoid undersampling with small data',
        'Tune decision threshold',
        'Ensemble methods',
        'Validate on original distribution'
    ],
    'Recommendation': [
        'Accuracy is misleading; use F1, PR-AUC, or balanced accuracy',
        'Always use stratified train/test split to preserve class ratios',
        'PR-AUC better than ROC-AUC for severe imbalance (>1:10)',
        'Least invasive; no data augmentation needed',
        'Creates synthetic samples; good when minority samples < 1000',
        'Focuses on hard examples; better for complex decision boundaries',
        'Only when majority data is abundant (>100k)',
        'Optimize threshold on validation set; default 0.5 rarely optimal',
        'BalancedRandomForest, EasyEnsemble often outperform single models',
        'Apply resampling only to training data; evaluate on original distribution'
    ]
}
print(pd.DataFrame(best_practices).to_string(index=False))

When to Use

✅ Appropriate Use Cases:

SMOTE: Use when minority class has 100-1000 samples, features are continuous, k-NN makes sense in feature space
ADASYN: Use when decision boundary is complex, some minority samples are harder to classify than others
Class Weights: Use when you want minimal intervention, large datasets, or as first approach before resampling
Random Oversampling: Use as baseline, when SMOTE creates poor synthetic samples (high cardinality categorical)
Random Undersampling: Use only when majority class has >100k samples and storage/compute is limited
SMOTE+Tomek: Use when you want to clean decision boundary after oversampling

❌ Avoid When:

Don't apply SMOTE before train/test split—data leakage and optimistic evaluation
Avoid SMOTE with high-dimensional sparse data (text, one-hot)—distance metrics fail
Don't use standard accuracy as metric—always use F1, PR-AUC, or balanced accuracy
Avoid SMOTE with ordinal categorical features—interpolation may create invalid values
Don't undersample when minority class is already small (<100)—removes too much information
Avoid class weights without tuning—default balanced may not match business costs

Common Pitfalls

Resampling before cross-validation—synthetic samples leak into validation folds
Using Euclidean distance on mixed data types—SMOTE creates unrealistic synthetic points
Ignoring class distribution in test set—must reflect real-world imbalance
Treating imbalance as only problem—often data quality or feature issues matter more
Over-sampling to exactly 1:1—can cause overfitting; try 1:2 or 1:5 ratios
Not trying class weights first—often as effective as SMOTE with less complexity

Previous Feature Scaling: Standardization, Normalization, and Robust Scaling Next Imputation Strategies: From Simple to Advanced Techniques