Convolutional Neural Networks: Architecture and Components
Definition
Convolutional Neural Networks (CNNs) are specialized deep learning architectures designed to automatically learn spatial hierarchies of features from grid-like data, particularly images. Unlike fully-connected networks, CNNs use convolution operations that preserve spatial relationships and employ parameter sharing through learnable filters (kernels). Key components include convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully-connected layers for classification. Introduced by LeCun in 1989 and popularized by AlexNet in 2012, CNNs have become the dominant architecture for computer vision tasks, achieving superhuman performance in image classification, object detection, and segmentation.
Intuition
Imagine examining a photograph by sliding a magnifying glass across it, focusing on small patches at a time. Each position reveals different local features - edges, colors, textures. Now imagine having multiple magnifying glasses, each tuned to detect specific patterns: one finds vertical edges, another finds red colors, another finds circular shapes. This is how convolutions work. The 'magnifying glasses' are learnable filters that slide across the image computing dot products. Stacking many such layers creates a hierarchy: early layers detect simple lines and colors, middle layers combine these into shapes and textures, and deep layers recognize complete objects like eyes, wheels, or faces. Pooling layers act like summarization - instead of examining every pixel, we keep only the most important information from each region.
Mathematical Formula
Step-by-Step Explanation:
- 2D Convolution: Element-wise multiplication of kernel K with image patch I, summed to produce single output value
- Output Dimensions: Formula calculates output spatial size given input size I, kernel size K, padding P, and stride S
- Max Pooling: Takes maximum value from each pooling region, preserving strongest activation
- Average Pooling: Computes mean of pooling region, providing smoother downsampling
- Batch Norm: Normalizes activations using batch statistics (μ, σ), then scales \(\gamma\) and shifts \(eta\)
- Multiple Channels: Each filter produces one feature map; stack of filters produces volume of feature maps
Real-World Use Cases
ResNet-50 classifying 1000 object categories in ImageNet with 76% top-1 accuracy
Detecting diabetic retinopathy from retinal scans with accuracy exceeding ophthalmologists
Tesla's FSD system using CNNs for lane detection and object recognition
FaceID on iPhones using depth-aware CNNs for secure authentication
Analyzing crop health and predicting yields from satellite photographs
Detecting defects in semiconductor wafers and automotive parts
Implementation
Manual Implementation (No Libraries)
import numpy as np
from scipy.signal import convolve2d
class Conv2D:
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size if isinstance(kernel_size, tuple) else (kernel_size, kernel_size)
self.stride = stride
self.padding = padding
# Xavier initialization
limit = np.sqrt(6.0 / (in_channels + out_channels) * self.kernel_size[0] * self.kernel_size[1])
self.weights = np.random.uniform(-limit, limit, (out_channels, in_channels, *self.kernel_size))
self.bias = np.zeros(out_channels)
def forward(self, x):
batch_size, in_c, in_h, in_w = x.shape
k_h, k_w = self.kernel_size
# Pad input
if self.padding > 0:
x_padded = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding), (self.padding, self.padding)), mode='constant')
else:
x_padded = x
# Calculate output dimensions
out_h = (in_h + 2 * self.padding - k_h) // self.stride + 1
out_w = (in_w + 2 * self.padding - k_w) // self.stride + 1
# Initialize output
output = np.zeros((batch_size, self.out_channels, out_h, out_w))
# Perform convolution
for b in range(batch_size):
for oc in range(self.out_channels):
for ic in range(in_c):
for i in range(out_h):
for j in range(out_w):
i_start = i * self.stride
i_end = i_start + k_h
j_start = j * self.stride
j_end = j_start + k_w
patch = x_padded[b, ic, i_start:i_end, j_start:j_end]
output[b, oc, i, j] += np.sum(patch * self.weights[oc, ic])
output[b, oc] += self.bias[oc]
self.input = x_padded
return output
class MaxPool2D:
def __init__(self, pool_size=2, stride=2):
self.pool_size = pool_size if isinstance(pool_size, tuple) else (pool_size, pool_size)
self.stride = stride
def forward(self, x):
batch_size, channels, in_h, in_w = x.shape
pool_h, pool_w = self.pool_size
out_h = (in_h - pool_h) // self.stride + 1
out_w = (in_w - pool_w) // self.stride + 1
output = np.zeros((batch_size, channels, out_h, out_w))
self.mask = {} # Store for backpropagation
for b in range(batch_size):
for c in range(channels):
for i in range(out_h):
for j in range(out_w):
i_start = i * self.stride
i_end = i_start + pool_h
j_start = j * self.stride
j_end = j_start + pool_w
patch = x[b, c, i_start:i_end, j_start:j_end]
output[b, c, i, j] = np.max(patch)
# Store mask for backprop
max_idx = np.unravel_index(np.argmax(patch), patch.shape)
self.mask[(b, c, i, j)] = (i_start + max_idx[0], j_start + max_idx[1])
return output
class SimpleCNN:
def __init__(self):
self.conv1 = Conv2D(1, 8, kernel_size=3, padding=1)
self.pool1 = MaxPool2D(pool_size=2, stride=2)
self.conv2 = Conv2D(8, 16, kernel_size=3, padding=1)
self.pool2 = MaxPool2D(pool_size=2, stride=2)
def relu(self, x):
return np.maximum(0, x)
def forward(self, x):
# Conv1 -> ReLU -> Pool1
x = self.conv1.forward(x)
x = self.relu(x)
x = self.pool1.forward(x)
# Conv2 -> ReLU -> Pool2
x = self.conv2.forward(x)
x = self.relu(x)
x = self.pool2.forward(x)
return x
# Test with random image
np.random.seed(42)
input_image = np.random.randn(2, 1, 28, 28) # 2 samples, 1 channel, 28x28 image
cnn = SimpleCNN()
output = cnn.forward(input_image)
print(f'Input shape: {input_image.shape}')
print(f'Output shape: {output.shape}') # Should be (2, 16, 7, 7)
Using Libraries (torch, torch.nn, torchvision, tensorflow, keras)
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
# PyTorch CNN Implementation
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
# Layer 1: Conv -> BatchNorm -> ReLU -> Pool
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(32)
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
# Layer 2: Conv -> BatchNorm -> ReLU -> Pool
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
# Layer 3: Conv -> BatchNorm -> ReLU -> Pool
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.pool3 = nn.AdaptiveAvgPool2d((1, 1))
# Classifier
self.fc1 = nn.Linear(128, 256)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(256, num_classes)
def forward(self, x):
# Layer 1
x = self.conv1(x)
x = self.bn1(x)
x = F.relu(x)
x = self.pool1(x)
# Layer 2
x = self.conv2(x)
x = self.bn2(x)
x = F.relu(x)
x = self.pool2(x)
# Layer 3
x = self.conv3(x)
x = self.bn3(x)
x = F.relu(x)
x = self.pool3(x)
# Flatten and classify
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
# Training setup
model = ConvNet(num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Load MNIST dataset
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
# Training loop
for epoch in range(2):
for i, (images, labels) in enumerate(trainloader):
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print(f'Epoch [{epoch+1}/2], Step [{i+1}/{len(trainloader)}], Loss: {loss.item():.4f}')
# TensorFlow/Keras Implementation
import tensorflow as tf
from tensorflow.keras import layers, models
def create_cnn_tf():
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(28, 28, 1)),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu', padding='same'),
layers.BatchNormalization(),
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
return model
model_tf = create_cnn_tf()
model_tf.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
When to Use
✅ Appropriate Use Cases:
- Image classification, object detection, and segmentation tasks
- Any grid-like data where spatial relationships matter (spectrograms, time-series)
- When you need translation invariance (object location shouldn't matter)
- Problems with local patterns that compose into global structures
- When parameter efficiency is important (shared weights reduce parameters)
- Video analysis where spatial and temporal convolutions combine
❌ Avoid When:
- Non-grid data where spatial relationships don't exist (tabular data)
- When permutation invariance is needed (order of elements doesn't matter)
- Very small datasets where simpler models with regularization work better
- When interpretability requires knowing which specific pixels matter (use attention)
- Text processing where sequential structure is more important than spatial
- Problems requiring global context before local processing
Common Pitfalls
- Using too large kernels (3x3 is usually sufficient when stacked)
- Forgetting padding causes spatial dimension reduction at each layer
- Not using batch normalization leading to training instability
- Pool size too aggressive losing critical spatial information
- Too many pooling layers reducing spatial dimensions to 1x1 too early
- Not accounting for receptive field when designing architecture
- Using fully-connected layers with too many parameters after convolutions
- Forgetting to normalize inputs (images should be zero-centered, unit variance)