ML Learning Hub
Deep Learningadvanced

Generative Models: VAE & GAN

Learning the shape of data — then sampling new reality from the learned distribution

Autoencoder → Variational Autoencoder (ELBO, reparameterization trick, latent space interpolation) → GAN (generator/discriminator adversarial training, DCGAN, mode collapse solutions).

80 min
10 diagrams
7 Concepts Covered

Prerequisites

Neural Networks
CNN Architectures
Information Theory

Concepts Covered

AutoencoderELBOReparameterizationLatent SpaceGANDCGANMode Collapse

Key Formulas

ELBO (VAE)

Reconstruction term − KL divergence (regularizes latent space)

Reparameterization

Allows gradients to flow through the sampling operation

GAN Objective

Generator fools Discriminator; Discriminator detects fakes

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
Loading visualization…
🎯

From Discrimination to Generation

motivation

All previous models are discriminative: P(y|x) — given input, predict output. Generative models learn P(x) — the full distribution of the data. Once you've learned the distribution, you can sample new data points, interpolate between examples, detect anomalies (low-probability points), and do conditional generation. This is how Stable Diffusion, GPT, and DALL-E work at their core.

💡

VAE: The Probabilistic Compression

intuition

Autoencoders compress data to a latent code then reconstruct. But the latent space is disconnected — similar images aren't near each other, so you can't sample new points meaningfully. VAEs fix this by encoding distributions (μ, σ) instead of points, and penalizing deviation from N(0,I) via KL divergence. This forces a smooth, continuous latent space where interpolation and sampling make semantic sense.

The reparameterization trick z = μ + σ⊙ε is the key insight that makes VAE training possible. Without it, sampling is a non-differentiable operation — no gradients can flow.

The ELBO: Evidence Lower Bound

math

We want to maximize log p(x) — the likelihood of our data under the model. This is intractable directly (requires integrating over all z). Instead, we maximize the ELBO: reconstruction quality (how well we decode) minus KL divergence from prior (how much the encoder deviates from standard Gaussian). β-VAE adds a weight β to the KL term for disentangled representations.

ELBO — the objective being maximized in VAE training
🔬

GAN Training: The Adversarial Game

deepdive

Generator G takes noise z ~ N(0,I) and produces fake samples G(z). Discriminator D tries to distinguish real samples from fakes (output probability of being real). They play a minimax game: D maximizes log D(real) + log(1 - D(G(z))); G minimizes log(1 - D(G(z))) [equivalent to maximizing log D(G(z))]. At Nash equilibrium, G produces samples indistinguishable from real data.

Mode collapse: the generator finds a single (or few) point(s) that always fool the discriminator. Fix: Wasserstein GAN (WGAN-GP) with gradient penalty, spectral normalization, or minibatch discrimination.

</>

DCGAN Implementation

code
python31 lines
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=class="tok-num">100, img_channels=class="tok-num">3):
        super().__init__()
        self.net = nn.Sequential(
            class="tok-comment"># Project and reshape noise
            nn.Linear(latent_dim, class="tok-num">512 * class="tok-num">4 * class="tok-num">4),
            nn.Unflatten(class="tok-num">1, (class="tok-num">512, class="tok-num">4, class="tok-num">4)),
            class="tok-comment"># Upsample blocks
            *self._block(class="tok-num">512, class="tok-num">256), *self._block(class="tok-num">256, class="tok-num">128),
            *self._block(class="tok-num">128, class="tok-num">64),  *self._block(class="tok-num">64, class="tok-num">32),
            nn.ConvTranspose2d(class="tok-num">32, img_channels, class="tok-num">4, class="tok-num">2, class="tok-num">1),
            nn.Tanh()
        )
    def _block(self, in_c, out_c):
        return [nn.ConvTranspose2d(in_c, out_c, class="tok-num">4, class="tok-num">2, class="tok-num">1, bias=False),
                nn.BatchNorm2d(out_c), nn.ReLU(True)]
    def forward(self, z): return self.net(z)

class="tok-comment"># WGAN-GP training (more stable than vanilla GAN)
def gradient_penalty(D, real, fake, device):
    alpha = torch.rand(real.size(class="tok-num">0), class="tok-num">1, class="tok-num">1, class="tok-num">1).to(device)
    interpolated = alpha * real + (class="tok-num">1 - alpha) * fake
    interpolated.requires_grad_(True)
    d_interp = D(interpolated)
    gradients = torch.autograd.grad(d_interp, interpolated,
                grad_outputs=torch.ones_like(d_interp),
                create_graph=True)[class="tok-num">0]
    return ((gradients.norm(class="tok-num">2, dim=class="tok-num">1) - class="tok-num">1) ** class="tok-num">2).mean()

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.