Generative Adversarial Networks — Core Concepts
The Generative Modeling Problem
Generating realistic images is fundamentally hard. What makes a face look real isn’t a list of rules — it emerges from thousands of subtle correlations: how lighting falls on skin texture, how hair moves, how facial proportions relate to each other. Traditional methods tried to hand-engineer these rules and fell far short.
Ian Goodfellow and colleagues at the University of Montreal proposed an alternative in 2014: don’t define what makes an image good. Instead, train two networks to compete, and let the definition of “good” emerge from the competition.
The Two-Network Architecture
Generator (G): Takes a random noise vector $z$ (sampled from a simple distribution like Gaussian) as input and outputs a synthetic data sample — typically an image. The generator has no direct access to real data during training; it only receives feedback through the discriminator.
Discriminator (D): Takes an image (either real from the training set, or fake from the generator) and outputs a single number: the probability that the input is real.
During training:
- Sample real images from the training data
- Sample noise vectors $z$ → Generator produces fake images
- Discriminator classifies both real and fake images
- Update discriminator to be better at classification
- Update generator to make images the discriminator classifies as real
The Minimax Game
Goodfellow formalized this as a minimax game:
$$\min_G \max_D \mathcal{V}(D, G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$
The discriminator wants to maximize $\mathcal{V}$ (classify correctly). The generator wants to minimize it (fool the discriminator).
The theoretical result: at the Nash equilibrium, the generator’s distribution $p_g$ equals the real data distribution $p_{data}$, and the discriminator outputs exactly $\frac{1}{2}$ for every input (it can’t tell real from fake).
GAN Training Instability: The Real Challenge
In practice, reaching that equilibrium is extremely difficult. GANs suffer from several pathological behaviors:
Mode collapse: The generator discovers that producing one type of output (e.g., always generating the same face) reliably fools the discriminator. It stops exploring diversity and collapses to a single mode. Training continues, but the generator produces essentially identical outputs.
Vanishing gradients: When the discriminator becomes too good, $D(G(z)) \approx 0$ for all generator outputs. Then $\log(1 - D(G(z))) \approx 0$ as well, and the generator receives near-zero gradients and can’t improve.
Training oscillation: The two networks can cycle — generator improves, discriminator improves, generator improves again — without converging.
The original GAN loss was replaced by alternative formulations to address these issues. The most common fix: use $-\log D(G(z))$ as the generator loss (non-saturating loss) instead of $\log(1 - D(G(z)))$ to avoid vanishing gradients early in training.
Key Architectural Advances
DCGAN (2015): Radford et al. established architectural guidelines for stable GAN training: use strided convolutions instead of pooling, batch normalization in both networks, ReLU in the generator, LeakyReLU in the discriminator. DCGAN was the first GAN to reliably generate recognizable faces.
Wasserstein GAN (2017): Arjovsky et al. replaced the Jensen-Shannon divergence with the Wasserstein distance as the training objective. This provided meaningful gradients even when the generated and real distributions don’t overlap, dramatically improving training stability. WGAN-GP (gradient penalty variant) became a standard baseline.
Progressive GAN (2018): NVIDIA’s Karras et al. started training on low-resolution (4×4) images and progressively added layers to handle higher resolutions. This allowed training stable high-resolution GANs — the system that first generated photorealistic 1024×1024 human faces.
StyleGAN / StyleGAN2 (2019, 2020): NVIDIA’s follow-up separated the high-level style (face shape, identity) from fine details (freckles, hair texture) using a mapping network and adaptive instance normalization. StyleGAN2 produced the faces on ThisPersonDoesNotExist.com and remains a reference architecture for face generation.
BigGAN (2018): DeepMind scaled GANs to ImageNet, with class-conditional generation. BigGAN required 512 TPU cores to train but produced the highest-fidelity class-conditional images up to that point.
Conditional GANs
Vanilla GANs generate arbitrary samples with no control over the output. Conditional GANs (cGANs) add a conditioning signal — a class label, a text description, or another image:
$$\min_G \max_D \mathcal{V}(D, G) = \mathbb{E}[\log D(x|y)] + \mathbb{E}[\log(1 - D(G(z|y)))]$$
Pix2Pix (2017): Image-to-image translation conditioned on an input image. Learned to convert sketches to photos, satellite imagery to maps, grayscale to color.
CycleGAN (2017): Unpaired image-to-image translation (no paired training data). Used cycle-consistency loss to turn horses into zebras, photos into Monet paintings.
Why Diffusion Models Won
By 2021–2022, diffusion models (DALL-E 2, Stable Diffusion, Midjourney) had surpassed GANs for most image generation benchmarks:
- Better diversity: Diffusion models don’t suffer from mode collapse
- More stable training: Single-network loss rather than adversarial minimax
- Text conditioning: Easier to integrate with CLIP-style text-image pretraining
- FID scores: Fréchet Inception Distance (the standard image quality metric) favored diffusion at comparable compute
GANs remain competitive for: real-time inference (GANs generate in one forward pass; diffusion requires many steps), video generation, and domain-specific applications where GAN-specific architectures are well-established.
One thing to remember: GANs pioneered generative AI by reframing image synthesis as a competition rather than a reconstruction task — that adversarial framing was the key creative leap that opened the door to everything that followed.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Attention Mechanism The trick that made ChatGPT possible — how AI learned to focus on what actually matters instead of reading everything equally.
- Batch Normalization The 2015 trick that let researchers train much deeper neural networks — why keeping numbers in the right range makes AI learn 10x faster.
- Convolutional Neural Networks How AI learned to see — the surprisingly simple idea behind face recognition, self-driving cars, and medical imaging.
- Dropout Regularization How randomly switching off neurons during training makes AI models that generalize better — the counterintuitive trick that stopped neural networks from memorizing everything.