Generative Adversarial Networks — Deep Dive

GAN theory, Wasserstein distance, mode collapse analysis, StyleGAN's mapping network and AdaIN, FID metric deep dive, and the diffusion-GAN landscape in 2024.

Theoretical Foundations: f-Divergences

The original GAN minimizes Jensen-Shannon (JS) divergence between the generated distribution $p_g$ and the real data distribution $p_{data}$.

At the optimal discriminator $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$, the generator’s loss becomes:

$$\mathcal{L}G = 2 \cdot JSD(p{data} || p_g) - \log 4$$

Training the generator minimizes JSD. The problem: when $p_{data}$ and $p_g$ have disjoint support (common early in training), $JSD(p_{data} || p_g) = \log 2$ regardless of how close the distributions are. The gradient is zero — the generator can’t learn which direction to improve.

f-GAN (Nowozin et al., 2016) generalized the formulation to arbitrary f-divergences $D_f(p||q) = \int q(x) f(\frac{p(x)}{q(x)}) dx$, showing that many GAN variants correspond to different divergences.

Wasserstein Distance and WGAN

The Wasserstein-1 (Earth Mover’s) distance provides meaningful gradients even when distributions are disjoint:

$$W(p_{data}, p_g) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$

Intuitively: the minimum “work” required to transport mass from $p_g$ to $p_{data}$, where work = distance × mass moved.

By the Kantorovich-Rubinstein duality, this is equivalent to:

$$W(p_{data}, p_g) = \sup_{|f|L \leq 1} \mathbb{E}{x \sim p_{data}}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]$$

Where the supremum is over all 1-Lipschitz functions. WGAN replaces the discriminator with a critic (not bounded by sigmoid) that estimates this supremum, with 1-Lipschitz enforced via weight clipping.

WGAN-GP replaced weight clipping with a gradient penalty:

$$\mathcal{L}{GP} = \lambda \mathbb{E}{\hat{x} \sim p_{\hat{x}}}\left[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2\right]$$

Where $\hat{x}$ is sampled uniformly along straight lines between real and fake samples. This enforces the gradient norm constraint approximately, giving more stable training than weight clipping.

Mode Collapse: Formal Analysis

Mode collapse occurs when the generator finds a local equilibrium where producing a subset of the real data distribution reliably fools the discriminator. Why doesn’t the discriminator correct this?

When the generator concentrates all probability mass on a single mode $x^*$:

The discriminator learns that $D(x^) \approx 0$ (everything at $x^$ is fake)
The generator shifts to another mode
The discriminator updates again

This cycling — generator chasing, discriminator catching up — never converges. The generator and discriminator are in a non-equilibrium limit cycle.

Minibatch discrimination (Salimans et al., 2016): The discriminator receives multiple samples simultaneously and can detect if they’re too similar. This provides a direct signal against mode collapse.

Unrolled GANs: Train the generator against the discriminator after $k$ unrolled update steps (essentially optimizing against the future discriminator). More expensive but reduces mode collapse.

MAD-GAN and VEEGAN: Multi-agent approaches where multiple generators each specialize in different modes, with mechanisms to encourage coverage.

StyleGAN Architecture in Detail

StyleGAN (Karras et al., 2019) separated global style from stochastic variation through an 8-layer mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$.

The latent code $z \in \mathcal{Z}$ (standard Gaussian) is mapped to an intermediate latent space $w \in \mathcal{W}$. The mapping disentangles features: $\mathcal{W}$ is a learned, more organized latent space where directions correspond more cleanly to interpretable attributes.

Style is injected at each resolution level through AdaIN (Adaptive Instance Normalization):

$$\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$$

Where $y_{s,i}$ (scale) and $y_{b,i}$ (bias) are produced from $w$ by learned affine transformations. This controls the style at each resolution — coarse styles (pose, hair length) controlled by early layers, fine styles (freckles, exact hair color) by later layers.

Stochastic variation (individual hair placement, skin pores) is added by injecting Gaussian noise at each layer before AdaIN.

StyleGAN2 fixed the characteristic “blob” artifacts caused by AdaIN normalization statistics by using weight demodulation instead of instance normalization, and equalized learning rates (progressive growing became unnecessary).

Measuring GAN Quality: FID

Fréchet Inception Distance (FID, Heusel et al., 2017) is the standard quantitative metric for image generation quality:

$$FID = |\mu_r - \mu_g|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$$

Where $\mu_r, \Sigma_r$ and $\mu_g, \Sigma_g$ are the mean and covariance of Inception-v3 features for real and generated images respectively.

FID measures both quality (are samples realistic?) and diversity (do samples cover the real distribution?). A generator in mode collapse gets a very high FID even if each individual sample looks realistic.

Limitations: FID is sensitive to the number of samples used (standard: 50k), to the Inception model version, and doesn’t correlate perfectly with human preference. CLIP-based metrics (FID in CLIP feature space) have become complementary benchmarks.

Conditional Generation and Text-to-Image

AC-GAN (Odena et al., 2017): Class label is fed as input to both generator and discriminator. The discriminator additionally classifies the real/fake sample into one of the class labels.

BigGAN: Used class-conditional batch normalization — the batch norm scale and shift are class-specific, injected throughout the generator. Also introduced the truncation trick: sampling $z$ from a truncated Gaussian (clipping large values) improves quality at the expense of diversity.

Text-to-image GANs: StackGAN (Zhang et al., 2017) used two-stage synthesis — coarse 64×64 then refined 256×256 — conditioned on text embeddings. AttnGAN added attention over words in the text description to generate specific visual attributes. These were superseded by DALL-E (2021, VQ-VAE + transformer) and DALL-E 2 (2022, diffusion + CLIP).

The GAN vs. Diffusion Landscape in 2024

By 2024, the generation landscape was:

Task	Dominant Approach
Text-to-image (quality)	Diffusion (Stable Diffusion 3, Flux, DALL-E 3)
Real-time image generation	GANs (StyleGAN3, or GAN-distilled diffusion)
Video generation	Diffusion + transformer (Sora, Runway)
Voice cloning	Diffusion (Voicebox, ElevenLabs)
Data augmentation	GANs (domain-specific)
Face swapping	GAN-based pipelines

Consistency models and flow matching (used in Stable Diffusion 3) further compressed diffusion steps, bringing diffusion inference time close to GAN speed. Research into distilling diffusion models into single-step generators (Distribution Matching Distillation, ADD) produced models that combine diffusion quality with GAN inference speed.

Pure adversarial training hasn’t disappeared — it’s embedded in many modern architectures as a loss component even when the primary training signal is something else.

One thing to remember: GAN’s theoretical contribution — framing generation as a two-player game — was more important than any specific architecture. That adversarial loss idea now appears as a component in many models that aren’t “GANs” in the classical sense.

deep-learninggansstyleganwassersteinfidgenerative-aidiffusion