ML Learning Hub
Deep Learningintermediate

CNN Architectures: Classic → ResNet → ViT

Local pattern detectors that see edges, then textures, then faces — by stacking filters

Convolutional networks from scratch: convolution op, pooling, receptive field, then classic (LeNet/VGG), Inception, ResNet (skip connections), and Vision Transformer (ViT, patch embeddings).

55 min
14 diagrams
7 Concepts Covered

Prerequisites

Neural Networks
Deep Learning Optimization

Concepts Covered

ConvolutionPoolingSkip ConnectionsInceptionResNetViTPatch Embeddings

Key Formulas

Convolution

Slide a kernel K over input I, computing dot products

Output Size

H=height, k=kernel size, p=padding, s=stride

ResNet Skip

Residual connection: add input directly to output

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
Loading visualization…
Loading visualization…
🎯

Why Spatial Structure Matters

motivation

Flattening a 224×224 image into a vector loses all spatial relationships — pixel (0,0) has no special relationship to (0,1) in a dense network. CNNs exploit translation invariance: the filter that detects a horizontal edge works the same whether the edge is at the top or bottom of the image. This weight sharing drastically reduces parameters and gives CNNs their inductive bias for vision.

💡

Feature Hierarchy: From Edges to Objects

intuition

Layer 1 detectors: oriented edges and color blobs (Gabor-like filters). Layer 2: textures built from edge combinations. Layer 3: object parts (wheels, eyes, windows). Final layers: complete objects. This hierarchy was visualized by Zeiler & Fergus (2013) using DeconvNets — you can literally see what each layer 'sees'.

In a 3-layer deep CNN, each output neuron has a receptive field of (k-1)·3+1 pixels — e.g., three 3×3 layers give a 7×7 effective receptive field, same as one 7×7 but with fewer parameters and more non-linearities.

The Convolution Operation

math

A 2D convolution slides a K×K kernel across the input, computing a dot product at every position. With C_in input channels and C_out output channels, we have C_in × C_out × K² parameters — vastly fewer than a fully connected layer (H·W·C_in × H·W·C_out parameters).

Convolution output and parameter count
🔬

ResNet: Solving the Degradation Problem

deepdive

Adding more layers to a plain CNN should never hurt (identity mapping). Yet in practice, very deep plain networks trained worse. He et al. (2015) found the culprit: optimization difficulty, not overfitting. Skip connections allow the network to learn residuals F(x) = H(x) - x instead of H(x) directly. If the identity is the optimal mapping, the network just pushes F(x) → 0. This makes 100+ layer training tractable.

Residual block: output = learned residual + identity shortcut
⚙️

Modern Training Recipe (ResNet / EfficientNet)

algorithm
1

Augmentation: RandomHorizontalFlip, RandomCrop, ColorJitter, MixUp/CutMix

2

Architecture: use pretrained weights (ImageNet) — always better than random init

3

Unfreeze schedule: freeze backbone, train head for 5 epochs, then unfreeze all

4

Learning rate: layer-wise LR decay (deeper layers = smaller LR × 0.1 per block)

5

Regularization: Dropout before final FC, weight decay 1e-4, label smoothing 0.1

6

Optimizer: AdamW + CosineAnnealing with warmup

</>

Fine-tuning EfficientNet for Custom Classification

code
python38 lines
import timm
import torch.nn as nn
import torch.optim as optim

class="tok-comment"># ── Config ─────────────────────────────────────────────────────────────
num_classes = class="tok-num">10   class="tok-comment"># e.g. class="tok-num">10-class image dataset

class="tok-comment"># Load pretrained EfficientNet-B4
model = timm.create_model(
    class="tok-str">'efficientnet_b4',
    pretrained=True,
    num_classes=class="tok-num">0         class="tok-comment"># Remove classifier head
)

class="tok-comment"># Freeze backbone initially
for param in model.parameters():
    param.requires_grad = False

class="tok-comment"># Custom head
classifier = nn.Sequential(
    nn.AdaptiveAvgPool2d(class="tok-num">1),
    nn.Flatten(),
    nn.BatchNorm1d(model.num_features),
    nn.Dropout(class="tok-num">0.4),
    nn.Linear(model.num_features, num_classes)
)

class="tok-comment"># Stage class="tok-num">1: train head only (high LR)
optimizer = optim.AdamW(classifier.parameters(), lr=class="tok-num">1e-3)
class="tok-comment"># ... train for class="tok-num">5 epochs

class="tok-comment"># Stage class="tok-num">2: unfreeze + fine-tune all (low LR)
for param in model.parameters():
    param.requires_grad = True
optimizer = optim.AdamW([
    {class="tok-str">'params': model.parameters(), class="tok-str">'lr': class="tok-num">1e-5},
    {class="tok-str">'params': classifier.parameters(), class="tok-str">'lr': class="tok-num">1e-4}
])

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.