CNN Architectures: Classic → ResNet → ViT
“Local pattern detectors that see edges, then textures, then faces — by stacking filters”
Convolutional networks from scratch: convolution op, pooling, receptive field, then classic (LeNet/VGG), Inception, ResNet (skip connections), and Vision Transformer (ViT, patch embeddings).
Prerequisites
Concepts Covered
∑Key Formulas
Convolution
Slide a kernel K over input I, computing dot products
Output Size
H=height, k=kernel size, p=padding, s=stride
ResNet Skip
Residual connection: add input directly to output
▶Interactive Simulation
⬡Model Architecture
Why Spatial Structure Matters
Flattening a 224×224 image into a vector loses all spatial relationships — pixel (0,0) has no special relationship to (0,1) in a dense network. CNNs exploit translation invariance: the filter that detects a horizontal edge works the same whether the edge is at the top or bottom of the image. This weight sharing drastically reduces parameters and gives CNNs their inductive bias for vision.
Feature Hierarchy: From Edges to Objects
Layer 1 detectors: oriented edges and color blobs (Gabor-like filters). Layer 2: textures built from edge combinations. Layer 3: object parts (wheels, eyes, windows). Final layers: complete objects. This hierarchy was visualized by Zeiler & Fergus (2013) using DeconvNets — you can literally see what each layer 'sees'.
In a 3-layer deep CNN, each output neuron has a receptive field of (k-1)·3+1 pixels — e.g., three 3×3 layers give a 7×7 effective receptive field, same as one 7×7 but with fewer parameters and more non-linearities.
The Convolution Operation
A 2D convolution slides a K×K kernel across the input, computing a dot product at every position. With C_in input channels and C_out output channels, we have C_in × C_out × K² parameters — vastly fewer than a fully connected layer (H·W·C_in × H·W·C_out parameters).
ResNet: Solving the Degradation Problem
Adding more layers to a plain CNN should never hurt (identity mapping). Yet in practice, very deep plain networks trained worse. He et al. (2015) found the culprit: optimization difficulty, not overfitting. Skip connections allow the network to learn residuals F(x) = H(x) - x instead of H(x) directly. If the identity is the optimal mapping, the network just pushes F(x) → 0. This makes 100+ layer training tractable.
Modern Training Recipe (ResNet / EfficientNet)
Augmentation: RandomHorizontalFlip, RandomCrop, ColorJitter, MixUp/CutMix
Architecture: use pretrained weights (ImageNet) — always better than random init
Unfreeze schedule: freeze backbone, train head for 5 epochs, then unfreeze all
Learning rate: layer-wise LR decay (deeper layers = smaller LR × 0.1 per block)
Regularization: Dropout before final FC, weight decay 1e-4, label smoothing 0.1
Optimizer: AdamW + CosineAnnealing with warmup
Fine-tuning EfficientNet for Custom Classification
import timm import torch.nn as nn import torch.optim as optim class="tok-comment"># ── Config ───────────────────────────────────────────────────────────── num_classes = class="tok-num">10 class="tok-comment"># e.g. class="tok-num">10-class image dataset class="tok-comment"># Load pretrained EfficientNet-B4 model = timm.create_model( class="tok-str">'efficientnet_b4', pretrained=True, num_classes=class="tok-num">0 class="tok-comment"># Remove classifier head ) class="tok-comment"># Freeze backbone initially for param in model.parameters(): param.requires_grad = False class="tok-comment"># Custom head classifier = nn.Sequential( nn.AdaptiveAvgPool2d(class="tok-num">1), nn.Flatten(), nn.BatchNorm1d(model.num_features), nn.Dropout(class="tok-num">0.4), nn.Linear(model.num_features, num_classes) ) class="tok-comment"># Stage class="tok-num">1: train head only (high LR) optimizer = optim.AdamW(classifier.parameters(), lr=class="tok-num">1e-3) class="tok-comment"># ... train for class="tok-num">5 epochs class="tok-comment"># Stage class="tok-num">2: unfreeze + fine-tune all (low LR) for param in model.parameters(): param.requires_grad = True optimizer = optim.AdamW([ {class="tok-str">'params': model.parameters(), class="tok-str">'lr': class="tok-num">1e-5}, {class="tok-str">'params': classifier.parameters(), class="tok-str">'lr': class="tok-num">1e-4} ])
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.