Generative Adversarial Networks — Core Concepts

The architecture Ian Goodfellow invented in 2014 that created the first photorealistic AI-generated faces — how GANs work, why they're hard to train, and what replaced them.

The Generative Modeling Problem

Generating realistic images is fundamentally hard. What makes a face look real isn’t a list of rules — it emerges from thousands of subtle correlations: how lighting falls on skin texture, how hair moves, how facial proportions relate to each other. Traditional methods tried to hand-engineer these rules and fell far short.

Ian Goodfellow and colleagues at the University of Montreal proposed an alternative in 2014: don’t define what makes an image good. Instead, train two networks to compete, and let the definition of “good” emerge from the competition.

The Two-Network Architecture

Generator (G): Takes a random noise vector $z$ (sampled from a simple distribution like Gaussian) as input and outputs a synthetic data sample — typically an image. The generator has no direct access to real data during training; it only receives feedback through the discriminator.

Discriminator (D): Takes an image (either real from the training set, or fake from the generator) and outputs a single number: the probability that the input is real.

During training:

Sample real images from the training data
Sample noise vectors $z$ → Generator produces fake images
Discriminator classifies both real and fake images
Update discriminator to be better at classification
Update generator to make images the discriminator classifies as real

The Minimax Game

Goodfellow formalized this as a minimax game:

$$\min_G \max_D \mathcal{V}(D, G) = \mathbb{E}{x \sim p{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

The discriminator wants to maximize $\mathcal{V}$ (classify correctly). The generator wants to minimize it (fool the discriminator).

The theoretical result: at the Nash equilibrium, the generator’s distribution $p_g$ equals the real data distribution $p_{data}$, and the discriminator outputs exactly $\frac{1}{2}$ for every input (it can’t tell real from fake).

GAN Training Instability: The Real Challenge

In practice, reaching that equilibrium is extremely difficult. GANs suffer from several pathological behaviors:

Mode collapse: The generator discovers that producing one type of output (e.g., always generating the same face) reliably fools the discriminator. It stops exploring diversity and collapses to a single mode. Training continues, but the generator produces essentially identical outputs.

Vanishing gradients: When the discriminator becomes too good, $D(G(z)) \approx 0$ for all generator outputs. Then $\log(1 - D(G(z))) \approx 0$ as well, and the generator receives near-zero gradients and can’t improve.

Training oscillation: The two networks can cycle — generator improves, discriminator improves, generator improves again — without converging.

The original GAN loss was replaced by alternative formulations to address these issues. The most common fix: use $-\log D(G(z))$ as the generator loss (non-saturating loss) instead of $\log(1 - D(G(z)))$ to avoid vanishing gradients early in training.

Key Architectural Advances

DCGAN (2015): Radford et al. established architectural guidelines for stable GAN training: use strided convolutions instead of pooling, batch normalization in both networks, ReLU in the generator, LeakyReLU in the discriminator. DCGAN was the first GAN to reliably generate recognizable faces.

Wasserstein GAN (2017): Arjovsky et al. replaced the Jensen-Shannon divergence with the Wasserstein distance as the training objective. This provided meaningful gradients even when the generated and real distributions don’t overlap, dramatically improving training stability. WGAN-GP (gradient penalty variant) became a standard baseline.

Progressive GAN (2018): NVIDIA’s Karras et al. started training on low-resolution (4×4) images and progressively added layers to handle higher resolutions. This allowed training stable high-resolution GANs — the system that first generated photorealistic 1024×1024 human faces.

StyleGAN / StyleGAN2 (2019, 2020): NVIDIA’s follow-up separated the high-level style (face shape, identity) from fine details (freckles, hair texture) using a mapping network and adaptive instance normalization. StyleGAN2 produced the faces on ThisPersonDoesNotExist.com and remains a reference architecture for face generation.

BigGAN (2018): DeepMind scaled GANs to ImageNet, with class-conditional generation. BigGAN required 512 TPU cores to train but produced the highest-fidelity class-conditional images up to that point.

Conditional GANs

Vanilla GANs generate arbitrary samples with no control over the output. Conditional GANs (cGANs) add a conditioning signal — a class label, a text description, or another image:

$$\min_G \max_D \mathcal{V}(D, G) = \mathbb{E}[\log D(x|y)] + \mathbb{E}[\log(1 - D(G(z|y)))]$$

Pix2Pix (2017): Image-to-image translation conditioned on an input image. Learned to convert sketches to photos, satellite imagery to maps, grayscale to color.

CycleGAN (2017): Unpaired image-to-image translation (no paired training data). Used cycle-consistency loss to turn horses into zebras, photos into Monet paintings.

Why Diffusion Models Won

By 2021–2022, diffusion models (DALL-E 2, Stable Diffusion, Midjourney) had surpassed GANs for most image generation benchmarks:

Better diversity: Diffusion models don’t suffer from mode collapse
More stable training: Single-network loss rather than adversarial minimax
Text conditioning: Easier to integrate with CLIP-style text-image pretraining
FID scores: Fréchet Inception Distance (the standard image quality metric) favored diffusion at comparable compute

GANs remain competitive for: real-time inference (GANs generate in one forward pass; diffusion requires many steps), video generation, and domain-specific applications where GAN-specific architectures are well-established.

One thing to remember: GANs pioneered generative AI by reframing image synthesis as a competition rather than a reconstruction task — that adversarial framing was the key creative leap that opened the door to everything that followed.

deep-learninggansgenerative-aiimage-synthesismode-collapse