Clustering: K-means, Hierarchical, and DBSCAN Comparison

Intermediate Ml
~8 min read Ml

Definition

Clustering is unsupervised learning's fundamental task: grouping data points into clusters where members are more similar to each other than to those in other clusters. Unlike classification, clusters are discovered rather than predefined. K-means partitions data into k spherical clusters by minimizing within-cluster sum of squares, iteratively assigning points to nearest centroids and recomputing centroids. Hierarchical clustering builds a tree of nested clusters through agglomerative (bottom-up merging) or divisive (top-down splitting) approaches, producing dendrograms that reveal cluster structure at multiple scales. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse areas, naturally handling outliers and discovering clusters of arbitrary shape. Each algorithm embodies different clustering philosophies: K-means assumes globular clusters and requires k specification; hierarchical provides rich structure but is computationally expensive; DBSCAN finds arbitrary shapes but struggles with varying densities. Understanding these trade-offs is essential for effective cluster analysis.

Intuition

💡

Imagine organizing a library. K-means is like sorting books onto k shelves by repeatedly moving each book to the shelf with the most similar books, then moving the shelf label to the average position. Hierarchical clustering is like building a family tree of books - grouping the two most similar first, then progressively merging groups. DBSCAN is like finding islands in an ocean - dense areas of books form clusters, while isolated books are just noise.

Mathematical Formula

K-means Objective (Within-Cluster Sum of Squares):
\[ \min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 \]
Centroid Update:
\[ \mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x \]
Ward's Method (Hierarchical - minimizing variance):
\[ \Delta(C_i, C_j) = \frac{|C_i||C_j|}{|C_i| + |C_j|} ||\mu_i - \mu_j||^2 \]
DBSCAN Core Points:
\[ N_\epsilon(p) = \{q \in D | d(p, q) \leq \epsilon\} \]
\[ p \text{ is core if } |N_\epsilon(p)| \geq MinPts \]
Silhouette Score (clustering quality):
\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]
where $a(i)$ = average intra-cluster distance, $b(i)$ = average nearest-cluster distance

Step-by-Step Explanation:

  1. K-means minimizes the sum of squared distances from points to their cluster centroids
  2. Centroids are the mean of all points in the cluster - minimizes WCSS
  3. Ward's method merges clusters that minimize increase in total within-cluster variance
  4. DBSCAN's epsilon defines neighborhood radius, MinPts defines density threshold
  5. Core points have at least MinPts within epsilon distance (including themselves)
  6. Silhouette score ranges from -1 (wrong cluster) to 1 (well-clustered), 0 indicates overlap

Real-World Use Cases

Marketing

Customer segmentation: group customers by purchasing behavior for targeted campaigns. K-means for simple segments, DBSCAN for detecting outliers (fraud).

Image Processing

Image compression: K-means clustering reduces color palette by grouping similar pixels. Used in image quantization.

Biology

Gene expression analysis: hierarchical clustering reveals relationships between genes with similar expression patterns.

Anomaly Detection

Network intrusion detection: DBSCAN identifies normal traffic clusters while flagging sparse points as potential attacks.

Implementation

Manual Implementation (No Libraries)

The K-means implementation uses k-means++ initialization for better convergence. Lloyd's algorithm iteratively assigns points to centroids and updates centroids until convergence. DBSCAN identifies core points (with sufficient neighbors), expands clusters from them, and labels sparse points as noise. These implementations demonstrate the fundamental mechanics of both algorithms.
import numpy as np
from collections import defaultdict

class KMeans:
    """
    Manual implementation of K-means clustering.
    Uses Lloyd's algorithm with k-means++ initialization.
    """
    
    def __init__(self, n_clusters=3, max_iter=300, tol=1e-4, random_state=None):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
        self.centroids = None
        self.labels_ = None
        self.inertia_ = None
    
    def _kmeans_plus_plus(self, X):
        """K-means++ initialization for better starting centroids."""
        np.random.seed(self.random_state)
        n_samples, n_features = X.shape
        centroids = np.zeros((self.n_clusters, n_features))
        
        # First centroid: random
        centroids[0] = X[np.random.choice(n_samples)]
        
        for k in range(1, self.n_clusters):
            # Compute distances from each point to nearest centroid
            distances = np.min([
                np.sum((X - centroids[i]) ** 2, axis=1) 
                for i in range(k)
            ], axis=0)
            
            # Choose next centroid with probability proportional to distance²
            probs = distances / distances.sum()
            centroids[k] = X[np.random.choice(n_samples, p=probs)]
        
        return centroids
    
    def fit(self, X):
        """Fit K-means to data."""
        np.random.seed(self.random_state)
        n_samples = X.shape[0]
        
        # Initialize centroids
        self.centroids = self._kmeans_plus_plus(X)
        
        for iteration in range(self.max_iter):
            # Assign points to nearest centroid
            distances = np.sqrt(
                np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
            )
            labels = np.argmin(distances, axis=1)
            
            # Update centroids
            new_centroids = np.array([
                X[labels == k].mean(axis=0) if np.sum(labels == k) > 0 
                else self.centroids[k]
                for k in range(self.n_clusters)
            ])
            
            # Check convergence
            centroid_shift = np.sum((new_centroids - self.centroids) ** 2)
            self.centroids = new_centroids
            
            if centroid_shift < self.tol:
                break
        
        # Final assignment
        distances = np.sqrt(
            np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
        )
        self.labels_ = np.argmin(distances, axis=1)
        
        # Compute inertia (within-cluster sum of squares)
        self.inertia_ = np.sum(
            (X - self.centroids[self.labels_]) ** 2
        )
        
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        distances = np.sqrt(
            np.sum((X[:, np.newaxis] - self.centroids) ** 2, axis=2)
        )
        return np.argmin(distances, axis=1)

class DBSCAN:
    """
    Manual implementation of DBSCAN clustering.
    """
    
    def __init__(self, eps=0.5, min_samples=5, metric='euclidean'):
        self.eps = eps
        self.min_samples = min_samples
        self.metric = metric
        self.labels_ = None
    
    def _region_query(self, X, point_idx):
        """Find all points within eps of point_idx."""
        distances = np.sqrt(np.sum((X - X[point_idx]) ** 2, axis=1))
        return np.where(distances <= self.eps)[0]
    
    def fit(self, X):
        """Fit DBSCAN to data."""
        n_samples = X.shape[0]
        self.labels_ = np.full(n_samples, -1)  # -1 = unclassified/noise
        cluster_id = 0
        
        for point_idx in range(n_samples):
            if self.labels_[point_idx] != -1:
                continue
            
            # Find neighbors
            neighbors = self._region_query(X, point_idx)
            
            # Check if core point
            if len(neighbors) < self.min_samples:
                self.labels_[point_idx] = -1  # Mark as noise temporarily
                continue
            
            # Start new cluster
            self._expand_cluster(X, point_idx, neighbors, cluster_id)
            cluster_id += 1
        
        return self
    
    def _expand_cluster(self, X, core_idx, neighbors, cluster_id):
        """Expand cluster from core point."""
        self.labels_[core_idx] = cluster_id
        i = 0
        
        while i < len(neighbors):
            point_idx = neighbors[i]
            
            if self.labels_[point_idx] == -1:
                self.labels_[point_idx] = cluster_id  # Change noise to border
            
            if self.labels_[point_idx] != -1:
                i += 1
                continue
            
            self.labels_[point_idx] = cluster_id
            point_neighbors = self._region_query(X, point_idx)
            
            if len(point_neighbors) >= self.min_samples:
                # Core point - add its neighbors
                neighbors = np.concatenate([neighbors, point_neighbors])
            
            i += 1

# Demonstration
if __name__ == '__main__':
    from sklearn.datasets import make_blobs
    import matplotlib.pyplot as plt
    
    # Generate sample data
    X, y_true = make_blobs(
        n_samples=300, centers=4, cluster_std=0.6, random_state=42
    )
    
    # K-means
    kmeans = KMeans(n_clusters=4, random_state=42)
    kmeans.fit(X)
    print(f'K-means Inertia: {kmeans.inertia_:.2f}')
    
    # DBSCAN
    dbscan = DBSCAN(eps=0.8, min_samples=5)
    dbscan.fit(X)
    n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
    print(f'DBSCAN clusters found: {n_clusters}')
    print(f'Noise points: {np.sum(dbscan.labels_ == -1)}')

Using Libraries (scikit-learn, scipy, numpy, matplotlib)

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs, make_moons
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import matplotlib.pyplot as plt

# Generate datasets for comparison
print('=== CLUSTERING ALGORITHM COMPARISON ===')

# Dataset 1: Well-separated blobs
X_blobs, y_blobs = make_blobs(
    n_samples=300, centers=4, cluster_std=0.6, random_state=42
)

# Dataset 2: Non-convex shapes (moons)
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

# Standardize features
scaler = StandardScaler()
X_blobs_scaled = scaler.fit_transform(X_blobs)
X_moons_scaled = scaler.fit_transform(X_moons)

# K-MEANS
print('
=== K-MEANS CLUSTERING ===')

# Elbow method to find optimal k
inertias = []
silhouettes = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_blobs_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_blobs_scaled, kmeans.labels_))

optimal_k = k_range[np.argmax(silhouettes)]
print(f'Optimal k by Silhouette: {optimal_k}')
print(f'Silhouette Scores: {[f"{s:.3f}" for s in silhouettes]}')

# Final K-means with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans.fit(X_blobs_scaled)
print(f'K-means Inertia: {kmeans.inertia_:.2f}')
print(f'K-means Silhouette: {silhouette_score(X_blobs_scaled, kmeans.labels_):.3f}')

# AGGLOMERATIVE CLUSTERING
print('
=== HIERARCHICAL CLUSTERING ===')

# Different linkage methods
linkages = ['ward', 'complete', 'average', 'single']
for link in linkages:
    agg = AgglomerativeClustering(n_clusters=4, linkage=link)
    labels = agg.fit_predict(X_blobs_scaled)
    sil = silhouette_score(X_blobs_scaled, labels)
    print(f'{link.capitalize()} linkage Silhouette: {sil:.3f}')

# Ward is best for Euclidean distance
agg_ward = AgglomerativeClustering(n_clusters=4, linkage='ward')
agg_labels = agg_ward.fit_predict(X_blobs_scaled)
print(f'Ward Silhouette: {silhouette_score(X_blobs_scaled, agg_labels):.3f}')

# DBSCAN
print('
=== DBSCAN CLUSTERING ===')

# Find optimal eps using k-distance graph
from sklearn.neighbors import NearestNeighbors

neigh = NearestNeighbors(n_neighbors=5)
neigh.fit(X_blobs_scaled)
distances, indices = neigh.kneighbors(X_blobs_scaled)
distances = np.sort(distances[:, 4])  # 4th nearest neighbor

# Use elbow of k-distance plot to determine eps (typically around the knee)
eps_optimal = np.percentile(distances, 90)  # Heuristic
print(f'Suggested eps (90th percentile): {eps_optimal:.2f}')

dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_blobs_scaled)
n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
print(f'DBSCAN clusters: {n_clusters}')
print(f'Noise points: {np.sum(dbscan.labels_ == -1)}')

# Silhouette excluding noise points
mask = dbscan.labels_ != -1
if np.sum(mask) > 1 and n_clusters > 1:
    sil_dbscan = silhouette_score(X_blobs_scaled[mask], dbscan.labels_[mask])
    print(f'DBSCAN Silhouette (no noise): {sil_dbscan:.3f}')

# COMPARISON ON NON-CONVEX DATA
print('
=== PERFORMANCE ON MOONS (NON-CONVEX) ===')

kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_moons.fit(X_moons_scaled)
print(f'K-means Silhouette: {silhouette_score(X_moons_scaled, kmeans_moons.labels_):.3f}')

dbscan_moons = DBSCAN(eps=0.3, min_samples=5)
dbscan_moons.fit(X_moons_scaled)
mask_moons = dbscan_moons.labels_ != -1
if np.sum(mask_moons) > 1:
    sil_dbscan_moons = silhouette_score(X_moons_scaled[mask_moons], dbscan_moons.labels_[mask_moons])
    print(f'DBSCAN Silhouette: {sil_dbscan_moons:.3f}')

# ADVANCED METRICS
print('
=== CLUSTERING QUALITY METRICS ===')

# Calinski-Harabasz Index (higher is better)
ch_score = calinski_harabasz_score(X_blobs_scaled, kmeans.labels_)
print(f'Calinski-Harabasz: {ch_score:.2f}')

# Davies-Bouldin Index (lower is better)
db_index = davies_bouldin_score(X_blobs_scaled, kmeans.labels_)
print(f'Davies-Bouldin: {db_index:.3f}')

# CLUSTERING PIPELINE EXAMPLE
print('
=== COMPLETE CLUSTERING PIPELINE ===')
from sklearn.decomposition import PCA

# Dimensionality reduction before clustering
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_blobs_scaled)

kmeans_pca = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_pca.fit(X_pca)
print(f'PCA + K-means Silhouette: {silhouette_score(X_pca, kmeans_pca.labels_):.3f}')

When to Use

✅ Appropriate Use Cases:

  • K-means: Spherical clusters, known approximate number of clusters, large datasets
  • Hierarchical: Exploring cluster structure at multiple levels, small datasets
  • DBSCAN: Arbitrary-shaped clusters, detecting outliers, unknown cluster count
  • Customer segmentation for targeted marketing (all methods)
  • Image compression via color quantization (K-means)
  • Anomaly detection where outliers are expected (DBSCAN)
  • Document/taxonomy organization (hierarchical)

❌ Avoid When:

  • K-means: Non-convex shapes, clusters of different densities, outliers present
  • Hierarchical: Large datasets (>10,000 points due to O(n²) complexity)
  • DBSCAN: Varying density clusters, high-dimensional data (curse of dimensionality)
  • When cluster count must be determined and no domain knowledge exists
  • When clusters overlap significantly (all methods struggle)
  • When absolute distance interpretation is unreliable

Common Pitfalls

  • Not scaling features: Distance-based clustering assumes comparable scales
  • Wrong choice of k: Use elbow method, silhouette analysis, or domain knowledge
  • Random initialization in K-means: Use k-means++ and multiple inits
  • Ignoring outliers: K-means is sensitive, consider DBSCAN or outlier removal first
  • Euclidean distance in high dimensions: Consider dimensionality reduction first
  • Not handling categorical features: Use appropriate encoding or distance metrics
  • Assuming clusters exist: Always validate with silhouette score or visualization