Deep Learningintermediate

CNN Architectures: Classic → ResNet → ViT

“Local pattern detectors that see edges, then textures, then faces — by stacking filters”

Convolutional networks from scratch: convolution op, pooling, receptive field, then classic (LeNet/VGG), Inception, ResNet (skip connections), and Vision Transformer (ViT, patch embeddings).

55 min

14 diagrams

7 Concepts Covered

Prerequisites

→Neural Networks

→Deep Learning Optimization

Concepts Covered

ConvolutionPoolingSkip ConnectionsInceptionResNetViTPatch Embeddings

Previous: Deep Learning Optimization Next: RNN, LSTM & GRU — Sequence Modeling

∑Key Formulas

Convolution

Slide a kernel K over input I, computing dot products

Output Size

H=height, k=kernel size, p=padding, s=stride

ResNet Skip

Residual connection: add input directly to output

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

Why Spatial Structure Matters

motivation

Flattening a 224×224 image into a vector loses all spatial relationships — pixel (0,0) has no special relationship to (0,1) in a dense network. CNNs exploit translation invariance: the filter that detects a horizontal edge works the same whether the edge is at the top or bottom of the image. This weight sharing drastically reduces parameters and gives CNNs their inductive bias for vision.

💡

Feature Hierarchy: From Edges to Objects

intuition

Layer 1 detectors: oriented edges and color blobs (Gabor-like filters). Layer 2: textures built from edge combinations. Layer 3: object parts (wheels, eyes, windows). Final layers: complete objects. This hierarchy was visualized by Zeiler & Fergus (2013) using DeconvNets — you can literally see what each layer 'sees'.

In a 3-layer deep CNN, each output neuron has a receptive field of (k-1)·3+1 pixels — e.g., three 3×3 layers give a 7×7 effective receptive field, same as one 7×7 but with fewer parameters and more non-linearities.

∑

The Convolution Operation

math

A 2D convolution slides a K×K kernel across the input, computing a dot product at every position. With C_in input channels and C_out output channels, we have C_in × C_out × K² parameters — vastly fewer than a fully connected layer (H·W·C_in × H·W·C_out parameters).

🔬

ResNet: Solving the Degradation Problem

deepdive

Adding more layers to a plain CNN should never hurt (identity mapping). Yet in practice, very deep plain networks trained worse. He et al. (2015) found the culprit: optimization difficulty, not overfitting. Skip connections allow the network to learn residuals F(x) = H(x) - x instead of H(x) directly. If the identity is the optimal mapping, the network just pushes F(x) → 0. This makes 100+ layer training tractable.

⚙️

Modern Training Recipe (ResNet / EfficientNet)

algorithm

Augmentation: RandomHorizontalFlip, RandomCrop, ColorJitter, MixUp/CutMix

Architecture: use pretrained weights (ImageNet) — always better than random init

Unfreeze schedule: freeze backbone, train head for 5 epochs, then unfreeze all

Learning rate: layer-wise LR decay (deeper layers = smaller LR × 0.1 per block)

Regularization: Dropout before final FC, weight decay 1e-4, label smoothing 0.1

Optimizer: AdamW + CosineAnnealing with warmup

</>

Fine-tuning EfficientNet for Custom Classification

code

python38 lines

import timm
import torch.nn as nn
import torch.optim as optim

# ── Config ─────────────────────────────────────────────────────────────
num_classes = 10   # e.g. 10-class image dataset

# Load pretrained EfficientNet-B4
model = timm.create_model(
    'efficientnet_b4',
    pretrained=True,
    num_classes=0         # Remove classifier head
)

# Freeze backbone initially
for param in model.parameters():
    param.requires_grad = False

# Custom head
classifier = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.BatchNorm1d(model.num_features),
    nn.Dropout(0.4),
    nn.Linear(model.num_features, num_classes)
)

# Stage 1: train head only (high LR)
optimizer = optim.AdamW(classifier.parameters(), lr=1e-3)
# ... train for 5 epochs

# Stage 2: unfreeze + fine-tune all (low LR)
for param in model.parameters():
    param.requires_grad = True
optimizer = optim.AdamW([
    {'params': model.parameters(), 'lr': 1e-5},
    {'params': classifier.parameters(), 'lr': 1e-4}
])

?Knowledge Check

Progress is saved in your browser — no account needed.

Deep Learning Optimization

RNN, LSTM & GRU — Sequence Modeling

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services