Naive Bayes: Bayes Theorem, Conditional Independence, and Text Classification
Definition
Naive Bayes is a family of probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between features. Despite its simplifying assumption that features are conditionally independent given the class - which is rarely true in practice - Naive Bayes classifiers often perform surprisingly well, particularly for text classification. The algorithm calculates posterior probabilities for each class and selects the class with highest probability. Variants include Gaussian (continuous features), Multinomial (discrete counts like word frequencies), and Bernoulli (binary features). The mathematical tractability from independence assumptions enables extremely fast training and prediction, even with high-dimensional data. Naive Bayes provides well-calibrated probability estimates and naturally handles missing values by ignoring them during probability calculation. Its simplicity, speed, and effectiveness with text data have made it a staple for spam filtering, sentiment analysis, and document categorization.
Intuition
Imagine you're a detective trying to determine if a crime was committed by suspect A or suspect B. You look at each piece of evidence separately - fingerprints, motive, alibi - and update your belief about each suspect's guilt. Even though the evidence pieces might actually be related (motive and opportunity often go together), treating them independently simplifies the math and often works well. The 'naive' assumption is like assuming each clue is independent - not quite true, but surprisingly effective.
Mathematical Formula
Step-by-Step Explanation:
- Bayes theorem inverts conditional probability: from P(evidence|class) to P(class|evidence)
- The naive assumption: P(x₁, x₂, ..., xₙ|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)
- Log-probabilities prevent underflow from multiplying many small probabilities
- Multinomial: word frequency with Laplace smoothing (alpha) for unseen words
- Gaussian: assumes features normally distributed within each class
- Prior P(y) can incorporate domain knowledge or use class frequencies
Real-World Use Cases
Spam filtering: classifies emails based on word frequencies. Despite its simplicity, remains highly effective for this task.
Classifying product reviews as positive/negative using bag-of-words features and Multinomial NB.
Categorizing articles into topics (sports, politics, technology) based on word presence.
Disease prediction from symptom presence/absence using Bernoulli NB with binary features.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from collections import defaultdict
class MultinomialNaiveBayes:
"""
Manual implementation of Multinomial Naive Bayes for text classification.
Uses Laplace smoothing and log-probabilities for numerical stability.
"""
def __init__(self, alpha=1.0):
self.alpha = alpha # Laplace smoothing parameter
self.class_priors = {}
self.feature_log_probs = {}
self.classes = None
self.n_features = None
def fit(self, X, y):
"""
Train Multinomial Naive Bayes.
X: Document-term matrix (n_samples, n_features) - word counts
y: Class labels
"""
self.classes = np.unique(y)
n_samples, self.n_features = X.shape
for c in self.classes:
# Get samples for this class
X_c = X[y == c]
n_c = X_c.shape[0]
# Calculate prior: P(y) = N_c / N
self.class_priors[c] = n_c / n_samples
# Calculate feature probabilities with Laplace smoothing
# P(x_i|y) = (count(x_i, y) + alpha) / (sum_counts(y) + alpha * n_features)
feature_counts = np.sum(X_c, axis=0) + self.alpha
total_count = np.sum(feature_counts)
# Store log probabilities for numerical stability
self.feature_log_probs[c] = np.log(feature_counts / total_count)
return self
def _compute_log_likelihood(self, X, c):
"""Compute log P(X|y=c) for all samples."""
# log P(X|y) = sum over features of (count * log P(feature|y))
return X @ self.feature_log_probs[c]
def predict_log_proba(self, X):
"""Compute log P(y|X) for all samples and classes."""
log_probs = np.zeros((X.shape[0], len(self.classes)))
for idx, c in enumerate(self.classes):
# log P(y|X) = log P(y) + log P(X|y)
log_prior = np.log(self.class_priors[c])
log_likelihood = self._compute_log_likelihood(X, c)
log_probs[:, idx] = log_prior + log_likelihood
return log_probs
def predict_proba(self, X):
"""Compute P(y|X) for all samples."""
log_probs = self.predict_log_proba(X)
# Normalize to probabilities using softmax
log_probs -= np.max(log_probs, axis=1, keepdims=True) # Numerical stability
probs = np.exp(log_probs)
probs /= np.sum(probs, axis=1, keepdims=True)
return probs
def predict(self, X):
"""Predict class labels."""
log_probs = self.predict_log_proba(X)
return self.classes[np.argmax(log_probs, axis=1)]
class GaussianNaiveBayes:
"""
Manual implementation of Gaussian Naive Bayes for continuous features.
"""
def __init__(self):
self.class_priors = {}
self.means = {}
self.vars = {}
self.classes = None
def fit(self, X, y):
"""Train Gaussian Naive Bayes."""
self.classes = np.unique(y)
n_samples = X.shape[0]
for c in self.classes:
X_c = X[y == c]
# Prior
self.class_priors[c] = X_c.shape[0] / n_samples
# Mean and variance for each feature (class conditional)
self.means[c] = np.mean(X_c, axis=0)
self.vars[c] = np.var(X_c, axis=0) + 1e-9 # Add epsilon for stability
return self
def _gaussian_pdf(self, x, mean, var):
"""Gaussian probability density function."""
return -0.5 * np.log(2 * np.pi * var) - 0.5 * ((x - mean) ** 2) / var
def predict_log_proba(self, X):
"""Compute log probabilities."""
log_probs = np.zeros((X.shape[0], len(self.classes)))
for idx, c in enumerate(self.classes):
log_prior = np.log(self.class_priors[c])
# Log likelihood for Gaussian
log_likelihood = np.sum(
self._gaussian_pdf(X, self.means[c], self.vars[c]),
axis=1
)
log_probs[:, idx] = log_prior + log_likelihood
return log_probs
def predict(self, X):
"""Predict class labels."""
log_probs = self.predict_log_proba(X)
return self.classes[np.argmax(log_probs, axis=1)]
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# Example with Gaussian NB on continuous data
X, y = make_classification(
n_samples=300, n_features=5, n_redundant=0,
n_informative=5, n_clusters_per_class=1, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Custom Gaussian NB
gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)
acc = np.mean(gnb.predict(X_test) == y_test)
print(f'Gaussian NB Accuracy: {acc:.3f}')
Using Libraries (scikit-learn, numpy)
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups, load_iris, make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import numpy as np
# GAUSSIAN NAIVE BAYES (Continuous Features)
print('=== GAUSSIAN NAIVE BAYES ===')
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42
)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print(f'Accuracy: {gnb.score(X_test, y_test):.3f}')
print(f'Class Priors: {gnb.class_prior_}')
# MULTINOMIAL NAIVE BAYES (Text Classification)
print('
=== MULTINOMIAL NAIVE BAYES (TEXT) ===')
# Load 20 newsgroups dataset (sample)
categories = ['alt.atheism', 'sci.space', 'talk.religion.misc', 'comp.graphics']
newsgroups = fetch_20newsgroups(
subset='train', categories=categories, shuffle=True, random_state=42
)
# Split before vectorization to prevent data leakage
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
newsgroups.data, newsgroups.target, test_size=0.2, random_state=42
)
# Create pipeline: Vectorization + Naive Bayes
pipeline = Pipeline([
('vect', CountVectorizer(stop_words='english', max_features=10000)),
('clf', MultinomialNB(alpha=0.1))
])
pipeline.fit(X_train_text, y_train_text)
print(f'Text Classification Accuracy: {pipeline.score(X_test_text, y_test_text):.3f}')
# Compare different smoothing values
print('
=== LAPLACE SMOOTHING COMPARISON ===')
alphas = [0.01, 0.1, 0.5, 1.0, 2.0]
for alpha in alphas:
mnb = MultinomialNB(alpha=alpha)
mnb.fit(pipeline.named_steps['vect'].transform(X_train_text), y_train_text)
acc = mnb.score(pipeline.named_steps['vect'].transform(X_test_text), y_test_text)
print(f'Alpha={alpha}: {acc:.3f}')
# BERNOULLI NAIVE BAYES (Binary Features)
print('
=== BERNOULLI NAIVE BAYES ===')
bnb = BernoulliNB(alpha=0.1)
# Use binary features
vectorizer_binary = CountVectorizer(stop_words='english', binary=True, max_features=5000)
X_train_binary = vectorizer_binary.fit_transform(X_train_text)
X_test_binary = vectorizer_binary.transform(X_test_text)
bnb.fit(X_train_binary, y_train_text)
print(f'Bernoulli NB Accuracy: {bnb.score(X_test_binary, y_test_text):.3f}')
# COMPLEMENT NAIVE BAYES (for imbalanced data)
print('
=== COMPLEMENT NAIVE BAYES ===')
cnb = ComplementNB(alpha=0.1)
cnb.fit(pipeline.named_steps['vect'].transform(X_train_text), y_train_text)
print(f'Complement NB Accuracy: {cnb.score(pipeline.named_steps['vect'].transform(X_test_text), y_test_text):.3f}')
# Feature Log Probabilities (most discriminative words)
print('
=== MOST DISCRIMINATIVE WORDS ===')
feature_names = pipeline.named_steps['vect'].get_feature_names_out()
for i, category in enumerate(categories):
# Get log probabilities for this class
log_probs = pipeline.named_steps['clf'].feature_log_prob_[i]
# Get top 5 features
top_indices = np.argsort(log_probs)[-5:]
top_words = [feature_names[j] for j in top_indices]
print(f'{category}: {top_words}')
# Cross-validation
print('
=== CROSS-VALIDATION ===')
cv_scores = cross_val_score(
GaussianNB(), X_iris, y_iris, cv=5, scoring='accuracy'
)
print(f'CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})')
When to Use
✅ Appropriate Use Cases:
- Text classification (spam, sentiment, topic categorization)
- High-dimensional data with many features
- Real-time applications requiring fast prediction
- Baseline model before trying complex classifiers
- When you need probability estimates quickly
- Multi-class problems with categorical features
- When conditional independence is approximately true
❌ Avoid When:
- When feature correlations are critical: independence assumption breaks down
- Complex decision boundaries requiring interaction terms
- When you need calibrated probabilities for all classes
- Regression problems (predict continuous values)
- Very small datasets where prior choice dominates
- When features have very different scales (except Gaussian variant)
Common Pitfalls
- Zero probability problem: Always use Laplace smoothing (alpha > 0)
- Feature correlation: Strongly correlated features get double-counted
- Discretizing continuous features for Multinomial NB loses information
- Ignoring prior probabilities: Can lead to biased predictions
- Not using log-probabilities: Underflow from multiplying small values
- Wrong variant choice: Gaussian for continuous, Multinomial for counts
- Imbalanced classes: Consider Complement NB or adjust priors