Convolutional Neural Networks: Architecture and Components

Definition

Convolutional Neural Networks (CNNs) are specialized deep learning architectures designed to automatically learn spatial hierarchies of features from grid-like data, particularly images. Unlike fully-connected networks, CNNs use convolution operations that preserve spatial relationships and employ parameter sharing through learnable filters (kernels). Key components include convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully-connected layers for classification. Introduced by LeCun in 1989 and popularized by AlexNet in 2012, CNNs have become the dominant architecture for computer vision tasks, achieving superhuman performance in image classification, object detection, and segmentation.

Intuition

💡

Imagine examining a photograph by sliding a magnifying glass across it, focusing on small patches at a time. Each position reveals different local features - edges, colors, textures. Now imagine having multiple magnifying glasses, each tuned to detect specific patterns: one finds vertical edges, another finds red colors, another finds circular shapes. This is how convolutions work. The 'magnifying glasses' are learnable filters that slide across the image computing dot products. Stacking many such layers creates a hierarchy: early layers detect simple lines and colors, middle layers combine these into shapes and textures, and deep layers recognize complete objects like eyes, wheels, or faces. Pooling layers act like summarization - instead of examining every pixel, we keep only the most important information from each region.

Mathematical Formula

2D Convolution:

(I * K)(i, j) = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} I(i+m, j+n) \cdot K(m, n)

With Stride s and Padding p:

O_h = \left\lfloor \frac{I_h + 2p - k_h}{s} \right\rfloor + 1

O_w = \left\lfloor \frac{I_w + 2p - k_w}{s} \right\rfloor + 1

Max Pooling:

\text{MaxPool}(x)_{i,j} = \max_{m,n \in \text{pool region}} x_{i+m, j+n}

Average Pooling:

\text{AvgPool}(x)_{i,j} = \frac{1}{k_h k_w} \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} x_{i+m, j+n}

Batch Normalization:

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y = \gamma \hat{x} + \beta

Output Feature Map Size:

O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1

Step-by-Step Explanation:

2D Convolution: Element-wise multiplication of kernel K with image patch I, summed to produce single output value
Output Dimensions: Formula calculates output spatial size given input size I, kernel size K, padding P, and stride S
Max Pooling: Takes maximum value from each pooling region, preserving strongest activation
Average Pooling: Computes mean of pooling region, providing smoother downsampling
Batch Norm: Normalizes activations using batch statistics (μ, σ), then scales \(\gamma\) and shifts \(eta\)
Multiple Channels: Each filter produces one feature map; stack of filters produces volume of feature maps

Real-World Use Cases

Image Classification

ResNet-50 classifying 1000 object categories in ImageNet with 76% top-1 accuracy

Medical Imaging

Detecting diabetic retinopathy from retinal scans with accuracy exceeding ophthalmologists

Autonomous Vehicles

Tesla's FSD system using CNNs for lane detection and object recognition

Facial Recognition

FaceID on iPhones using depth-aware CNNs for secure authentication

Satellite Imagery

Analyzing crop health and predicting yields from satellite photographs

Manufacturing QC

Detecting defects in semiconductor wafers and automotive parts

Implementation

Manual Implementation (No Libraries)

This implementation includes a Conv2D layer that performs valid convolution with configurable stride and padding, a MaxPool2D layer that downsamples by taking maximum values in pooling regions, and a SimpleCNN that stacks these layers. The convolution uses 4D tensors (batch, channel, height, width) and Xavier initialization for weights.

import numpy as np
from scipy.signal import convolve2d

class Conv2D:
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size if isinstance(kernel_size, tuple) else (kernel_size, kernel_size)
        self.stride = stride
        self.padding = padding
        
        # Xavier initialization
        limit = np.sqrt(6.0 / (in_channels + out_channels) * self.kernel_size[0] * self.kernel_size[1])
        self.weights = np.random.uniform(-limit, limit, (out_channels, in_channels, *self.kernel_size))
        self.bias = np.zeros(out_channels)
        
    def forward(self, x):
        
        batch_size, in_c, in_h, in_w = x.shape
        k_h, k_w = self.kernel_size
        
        # Pad input
        if self.padding > 0:
            x_padded = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)), mode='constant')
        else:
            x_padded = x
        
        # Calculate output dimensions
        out_h = (in_h + 2 * self.padding - k_h) // self.stride + 1
        out_w = (in_w + 2 * self.padding - k_w) // self.stride + 1
        
        # Initialize output
        output = np.zeros((batch_size, self.out_channels, out_h, out_w))
        
        # Perform convolution
        for b in range(batch_size):
            for oc in range(self.out_channels):
                for ic in range(in_c):
                    for i in range(out_h):
                        for j in range(out_w):
                            i_start = i * self.stride
                            i_end = i_start + k_h
                            j_start = j * self.stride
                            j_end = j_start + k_w
                            
                            patch = x_padded[b, ic, i_start:i_end, j_start:j_end]
                            output[b, oc, i, j] += np.sum(patch * self.weights[oc, ic])
                output[b, oc] += self.bias[oc]
        
        self.input = x_padded
        return output

class MaxPool2D:
    def __init__(self, pool_size=2, stride=2):
        self.pool_size = pool_size if isinstance(pool_size, tuple) else (pool_size, pool_size)
        self.stride = stride
    
    def forward(self, x):
        batch_size, channels, in_h, in_w = x.shape
        pool_h, pool_w = self.pool_size
        
        out_h = (in_h - pool_h) // self.stride + 1
        out_w = (in_w - pool_w) // self.stride + 1
        
        output = np.zeros((batch_size, channels, out_h, out_w))
        self.mask = {}  # Store for backpropagation
        
        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        i_start = i * self.stride
                        i_end = i_start + pool_h
                        j_start = j * self.stride
                        j_end = j_start + pool_w
                        
                        patch = x[b, c, i_start:i_end, j_start:j_end]
                        output[b, c, i, j] = np.max(patch)
                        
                        # Store mask for backprop
                        max_idx = np.unravel_index(np.argmax(patch), patch.shape)
                        self.mask[(b, c, i, j)] = (i_start + max_idx[0], j_start + max_idx[1])
        
        return output

class SimpleCNN:
    def __init__(self):
        self.conv1 = Conv2D(1, 8, kernel_size=3, padding=1)
        self.pool1 = MaxPool2D(pool_size=2, stride=2)
        self.conv2 = Conv2D(8, 16, kernel_size=3, padding=1)
        self.pool2 = MaxPool2D(pool_size=2, stride=2)
        
    def relu(self, x):
        return np.maximum(0, x)
    
    def forward(self, x):
        # Conv1 -> ReLU -> Pool1
        x = self.conv1.forward(x)
        x = self.relu(x)
        x = self.pool1.forward(x)
        
        # Conv2 -> ReLU -> Pool2
        x = self.conv2.forward(x)
        x = self.relu(x)
        x = self.pool2.forward(x)
        
        return x

# Test with random image
np.random.seed(42)
input_image = np.random.randn(2, 1, 28, 28)  # 2 samples, 1 channel, 28x28 image
cnn = SimpleCNN()
output = cnn.forward(input_image)
print(f'Input shape: {input_image.shape}')
print(f'Output shape: {output.shape}')  # Should be (2, 16, 7, 7)

Using Libraries (torch, torch.nn, torchvision, tensorflow, keras)

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

# PyTorch CNN Implementation
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        
        # Layer 1: Conv -> BatchNorm -> ReLU -> Pool
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Layer 2: Conv -> BatchNorm -> ReLU -> Pool
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Layer 3: Conv -> BatchNorm -> ReLU -> Pool
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool3 = nn.AdaptiveAvgPool2d((1, 1))
        
        # Classifier
        self.fc1 = nn.Linear(128, 256)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)
    
    def forward(self, x):
        # Layer 1
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.relu(x)
        x = self.pool1(x)
        
        # Layer 2
        x = self.conv2(x)
        x = self.bn2(x)
        x = F.relu(x)
        x = self.pool2(x)
        
        # Layer 3
        x = self.conv3(x)
        x = self.bn3(x)
        x = F.relu(x)
        x = self.pool3(x)
        
        # Flatten and classify
        x = x.view(x.size(0), -1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Training setup
model = ConvNet(num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Load MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Training loop
for epoch in range(2):
    for i, (images, labels) in enumerate(trainloader):
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print(f'Epoch [{epoch+1}/2], Step [{i+1}/{len(trainloader)}], Loss: {loss.item():.4f}')

# TensorFlow/Keras Implementation
import tensorflow as tf
from tensorflow.keras import layers, models

def create_cnn_tf():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        
        layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
        layers.BatchNormalization(),
        layers.GlobalAveragePooling2D(),
        
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])
    return model

model_tf = create_cnn_tf()
model_tf.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

When to Use

✅ Appropriate Use Cases:

Image classification, object detection, and segmentation tasks
Any grid-like data where spatial relationships matter (spectrograms, time-series)
When you need translation invariance (object location shouldn't matter)
Problems with local patterns that compose into global structures
When parameter efficiency is important (shared weights reduce parameters)
Video analysis where spatial and temporal convolutions combine

❌ Avoid When:

Non-grid data where spatial relationships don't exist (tabular data)
When permutation invariance is needed (order of elements doesn't matter)
Very small datasets where simpler models with regularization work better
When interpretability requires knowing which specific pixels matter (use attention)
Text processing where sequential structure is more important than spatial
Problems requiring global context before local processing

Common Pitfalls

Using too large kernels (3x3 is usually sufficient when stacked)
Forgetting padding causes spatial dimension reduction at each layer
Not using batch normalization leading to training instability
Pool size too aggressive losing critical spatial information
Too many pooling layers reducing spatial dimensions to 1x1 too early
Not accounting for receptive field when designing architecture
Using fully-connected layers with too many parameters after convolutions
Forgetting to normalize inputs (images should be zero-centered, unit variance)