Clustering: K-means, Hierarchical, and DBSCAN Comparison
Definition
Clustering is unsupervised learning's fundamental task: grouping data points into clusters where members are more similar to each other than to those in other clusters. Unlike classification, clusters are discovered rather than predefined. K-means partitions data into k spherical clusters by minimizing within-cluster sum of squares, iteratively assigning points to nearest centroids and recomputing centroids. Hierarchical clustering builds a tree of nested clusters through agglomerative (bottom-up merging) or divisive (top-down splitting) approaches, producing dendrograms that reveal cluster structure at multiple scales. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse areas, naturally handling outliers and discovering clusters of arbitrary shape. Each algorithm embodies different clustering philosophies: K-means assumes globular clusters and requires k specification; hierarchical provides rich structure but is computationally expensive; DBSCAN finds arbitrary shapes but struggles with varying densities. Understanding these trade-offs is essential for effective cluster analysis.
Intuition
Imagine organizing a library. K-means is like sorting books onto k shelves by repeatedly moving each book to the shelf with the most similar books, then moving the shelf label to the average position. Hierarchical clustering is like building a family tree of books - grouping the two most similar first, then progressively merging groups. DBSCAN is like finding islands in an ocean - dense areas of books form clusters, while isolated books are just noise.
Mathematical Formula
Step-by-Step Explanation:
- K-means minimizes the sum of squared distances from points to their cluster centroids
- Centroids are the mean of all points in the cluster - minimizes WCSS
- Ward's method merges clusters that minimize increase in total within-cluster variance
- DBSCAN's epsilon defines neighborhood radius, MinPts defines density threshold
- Core points have at least MinPts within epsilon distance (including themselves)
- Silhouette score ranges from -1 (wrong cluster) to 1 (well-clustered), 0 indicates overlap
Real-World Use Cases
Customer segmentation: group customers by purchasing behavior for targeted campaigns. K-means for simple segments, DBSCAN for detecting outliers (fraud).
Image compression: K-means clustering reduces color palette by grouping similar pixels. Used in image quantization.
Gene expression analysis: hierarchical clustering reveals relationships between genes with similar expression patterns.
Network intrusion detection: DBSCAN identifies normal traffic clusters while flagging sparse points as potential attacks.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from collections import defaultdict
class KMeans:
"""
Manual implementation of K-means clustering.
Uses Lloyd's algorithm with k-means++ initialization.
"""
def __init__(self, n_clusters=3, max_iter=300, tol=1e-4, random_state=None):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.random_state = random_state
self.centroids = None
self.labels_ = None
self.inertia_ = None
def _kmeans_plus_plus(self, X):
"""K-means++ initialization for better starting centroids."""
np.random.seed(self.random_state)
n_samples, n_features = X.shape
centroids = np.zeros((self.n_clusters, n_features))
# First centroid: random
centroids[0] = X[np.random.choice(n_samples)]
for k in range(1, self.n_clusters):
# Compute distances from each point to nearest centroid
distances = np.min([
np.sum((X - centroids[i]) ** 2, axis=1)
for i in range(k)
], axis=0)
# Choose next centroid with probability proportional to distance²
probs = distances / distances.sum()
centroids[k] = X[np.random.choice(n_samples, p=probs)]
return centroids
def fit(self, X):
"""Fit K-means to data."""
np.random.seed(self.random_state)
n_samples = X.shape[0]
# Initialize centroids
self.centroids = self._kmeans_plus_plus(X)
for iteration in range(self.max_iter):
# Assign points to nearest centroid
distances = np.sqrt(
np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
)
labels = np.argmin(distances, axis=1)
# Update centroids
new_centroids = np.array([
X[labels == k].mean(axis=0) if np.sum(labels == k) > 0
else self.centroids[k]
for k in range(self.n_clusters)
])
# Check convergence
centroid_shift = np.sum((new_centroids - self.centroids) ** 2)
self.centroids = new_centroids
if centroid_shift < self.tol:
break
# Final assignment
distances = np.sqrt(
np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
)
self.labels_ = np.argmin(distances, axis=1)
# Compute inertia (within-cluster sum of squares)
self.inertia_ = np.sum(
(X - self.centroids[self.labels_]) ** 2
)
return self
def predict(self, X):
"""Predict cluster labels for new data."""
distances = np.sqrt(
np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
)
return np.argmin(distances, axis=1)
class DBSCAN:
"""
Manual implementation of DBSCAN clustering.
"""
def __init__(self, eps=0.5, min_samples=5, metric='euclidean'):
self.eps = eps
self.min_samples = min_samples
self.metric = metric
self.labels_ = None
def _region_query(self, X, point_idx):
"""Find all points within eps of point_idx."""
distances = np.sqrt(np.sum((X - X[point_idx]) ** 2, axis=1))
return np.where(distances <= self.eps)[0]
def fit(self, X):
"""Fit DBSCAN to data."""
n_samples = X.shape[0]
self.labels_ = np.full(n_samples, -1) # -1 = unclassified/noise
cluster_id = 0
for point_idx in range(n_samples):
if self.labels_[point_idx] != -1:
continue
# Find neighbors
neighbors = self._region_query(X, point_idx)
# Check if core point
if len(neighbors) < self.min_samples:
self.labels_[point_idx] = -1 # Mark as noise temporarily
continue
# Start new cluster
self._expand_cluster(X, point_idx, neighbors, cluster_id)
cluster_id += 1
return self
def _expand_cluster(self, X, core_idx, neighbors, cluster_id):
"""Expand cluster from core point."""
self.labels_[core_idx] = cluster_id
i = 0
while i < len(neighbors):
point_idx = neighbors[i]
if self.labels_[point_idx] == -1:
self.labels_[point_idx] = cluster_id # Change noise to border
if self.labels_[point_idx] != -1:
i += 1
continue
self.labels_[point_idx] = cluster_id
point_neighbors = self._region_query(X, point_idx)
if len(point_neighbors) >= self.min_samples:
# Core point - add its neighbors
neighbors = np.concatenate([neighbors, point_neighbors])
i += 1
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, y_true = make_blobs(
n_samples=300, centers=4, cluster_std=0.6, random_state=42
)
# K-means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
print(f'K-means Inertia: {kmeans.inertia_:.2f}')
# DBSCAN
dbscan = DBSCAN(eps=0.8, min_samples=5)
dbscan.fit(X)
n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
print(f'DBSCAN clusters found: {n_clusters}')
print(f'Noise points: {np.sum(dbscan.labels_ == -1)}')
Using Libraries (scikit-learn, scipy, numpy, matplotlib)
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs, make_moons
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import matplotlib.pyplot as plt
# Generate datasets for comparison
print('=== CLUSTERING ALGORITHM COMPARISON ===')
# Dataset 1: Well-separated blobs
X_blobs, y_blobs = make_blobs(
n_samples=300, centers=4, cluster_std=0.6, random_state=42
)
# Dataset 2: Non-convex shapes (moons)
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
# Standardize features
scaler = StandardScaler()
X_blobs_scaled = scaler.fit_transform(X_blobs)
X_moons_scaled = scaler.fit_transform(X_moons)
# K-MEANS
print('
=== K-MEANS CLUSTERING ===')
# Elbow method to find optimal k
inertias = []
silhouettes = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_blobs_scaled)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X_blobs_scaled, kmeans.labels_))
optimal_k = k_range[np.argmax(silhouettes)]
print(f'Optimal k by Silhouette: {optimal_k}')
print(f'Silhouette Scores: {[f"{s:.3f}" for s in silhouettes]}')
# Final K-means with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans.fit(X_blobs_scaled)
print(f'K-means Inertia: {kmeans.inertia_:.2f}')
print(f'K-means Silhouette: {silhouette_score(X_blobs_scaled, kmeans.labels_):.3f}')
# AGGLOMERATIVE CLUSTERING
print('
=== HIERARCHICAL CLUSTERING ===')
# Different linkage methods
linkages = ['ward', 'complete', 'average', 'single']
for link in linkages:
agg = AgglomerativeClustering(n_clusters=4, linkage=link)
labels = agg.fit_predict(X_blobs_scaled)
sil = silhouette_score(X_blobs_scaled, labels)
print(f'{link.capitalize()} linkage Silhouette: {sil:.3f}')
# Ward is best for Euclidean distance
agg_ward = AgglomerativeClustering(n_clusters=4, linkage='ward')
agg_labels = agg_ward.fit_predict(X_blobs_scaled)
print(f'Ward Silhouette: {silhouette_score(X_blobs_scaled, agg_labels):.3f}')
# DBSCAN
print('
=== DBSCAN CLUSTERING ===')
# Find optimal eps using k-distance graph
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(X_blobs_scaled)
distances, indices = neigh.kneighbors(X_blobs_scaled)
distances = np.sort(distances[:, 4]) # 4th nearest neighbor
# Use elbow of k-distance plot to determine eps (typically around the knee)
eps_optimal = np.percentile(distances, 90) # Heuristic
print(f'Suggested eps (90th percentile): {eps_optimal:.2f}')
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_blobs_scaled)
n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
print(f'DBSCAN clusters: {n_clusters}')
print(f'Noise points: {np.sum(dbscan.labels_ == -1)}')
# Silhouette excluding noise points
mask = dbscan.labels_ != -1
if np.sum(mask) > 1 and n_clusters > 1:
sil_dbscan = silhouette_score(X_blobs_scaled[mask], dbscan.labels_[mask])
print(f'DBSCAN Silhouette (no noise): {sil_dbscan:.3f}')
# COMPARISON ON NON-CONVEX DATA
print('
=== PERFORMANCE ON MOONS (NON-CONVEX) ===')
kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_moons.fit(X_moons_scaled)
print(f'K-means Silhouette: {silhouette_score(X_moons_scaled, kmeans_moons.labels_):.3f}')
dbscan_moons = DBSCAN(eps=0.3, min_samples=5)
dbscan_moons.fit(X_moons_scaled)
mask_moons = dbscan_moons.labels_ != -1
if np.sum(mask_moons) > 1:
sil_dbscan_moons = silhouette_score(X_moons_scaled[mask_moons], dbscan_moons.labels_[mask_moons])
print(f'DBSCAN Silhouette: {sil_dbscan_moons:.3f}')
# ADVANCED METRICS
print('
=== CLUSTERING QUALITY METRICS ===')
# Calinski-Harabasz Index (higher is better)
ch_score = calinski_harabasz_score(X_blobs_scaled, kmeans.labels_)
print(f'Calinski-Harabasz: {ch_score:.2f}')
# Davies-Bouldin Index (lower is better)
db_index = davies_bouldin_score(X_blobs_scaled, kmeans.labels_)
print(f'Davies-Bouldin: {db_index:.3f}')
# CLUSTERING PIPELINE EXAMPLE
print('
=== COMPLETE CLUSTERING PIPELINE ===')
from sklearn.decomposition import PCA
# Dimensionality reduction before clustering
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_blobs_scaled)
kmeans_pca = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_pca.fit(X_pca)
print(f'PCA + K-means Silhouette: {silhouette_score(X_pca, kmeans_pca.labels_):.3f}')
When to Use
✅ Appropriate Use Cases:
- K-means: Spherical clusters, known approximate number of clusters, large datasets
- Hierarchical: Exploring cluster structure at multiple levels, small datasets
- DBSCAN: Arbitrary-shaped clusters, detecting outliers, unknown cluster count
- Customer segmentation for targeted marketing (all methods)
- Image compression via color quantization (K-means)
- Anomaly detection where outliers are expected (DBSCAN)
- Document/taxonomy organization (hierarchical)
❌ Avoid When:
- K-means: Non-convex shapes, clusters of different densities, outliers present
- Hierarchical: Large datasets (>10,000 points due to O(n²) complexity)
- DBSCAN: Varying density clusters, high-dimensional data (curse of dimensionality)
- When cluster count must be determined and no domain knowledge exists
- When clusters overlap significantly (all methods struggle)
- When absolute distance interpretation is unreliable
Common Pitfalls
- Not scaling features: Distance-based clustering assumes comparable scales
- Wrong choice of k: Use elbow method, silhouette analysis, or domain knowledge
- Random initialization in K-means: Use k-means++ and multiple inits
- Ignoring outliers: K-means is sensitive, consider DBSCAN or outlier removal first
- Euclidean distance in high dimensions: Consider dimensionality reduction first
- Not handling categorical features: Use appropriate encoding or distance metrics
- Assuming clusters exist: Always validate with silhouette score or visualization