Generative AI — Deep Dive

The math behind diffusion, the emergent behavior of transformers at scale, mode collapse, RLHF alignment, and why inference costs will reshape the market more than training did.

The Core Technical Landscape

Generative modeling is, at its core, a problem of learning a probability distribution. Given a dataset of images, text, or audio, we want to learn the distribution P(x) well enough to sample new x that look like they came from that distribution.

The question that divides architectures is: how do you represent and sample from that distribution efficiently?

Diffusion Models: The Math That Actually Works

Forward Process: Destroying Data Systematically

Diffusion models (Ho et al., 2020, “Denoising Diffusion Probabilistic Models”) define a Markov chain that gradually corrupts data with Gaussian noise over T timesteps:

q(x_t | x_{t-1}) = N(x_t; sqrt(1-β_t) * x_{t-1}, β_t * I)

Where β_t is a noise schedule (linear, cosine, or learned). After enough steps, x_T ≈ N(0, I) — pure noise. The key insight from DDPM is that you can derive a closed-form for the noisy image at any timestep t directly from x_0:

x_t = sqrt(ᾱ_t) * x_0 + sqrt(1 - ᾱ_t) * ε
where ᾱ_t = Π(1 - β_i) for i=1..t

This means you don’t need to run the full chain during training — you sample t randomly, corrupt x_0 directly to x_t, and train the network to predict ε.

Reverse Process: Denoising as Generation

The model learns a neural network ε_θ(x_t, t) that predicts the noise added to x_0 to get x_t. At inference, starting from x_T ~ N(0,I), you iteratively denoise:

x_{t-1} = (1/sqrt(α_t)) * (x_t - (β_t/sqrt(1-ᾱ_t)) * ε_θ(x_t, t)) + sigma_t * z

This is slow. The original DDPM required 1,000 denoising steps. DDIM (Song et al., 2020) reduced this to 20-50 steps with no meaningful quality loss by framing diffusion as an ODE rather than a stochastic process.

Classifier-Free Guidance: How Text Prompts Actually Work

Conditional diffusion models (text-to-image) use classifier-free guidance (Ho & Salimans, 2021). During training, the text conditioning c is randomly dropped 10-20% of the time. At inference, you compute two noise predictions — conditioned and unconditioned — and extrapolate:

ε_guided = ε_uncond + w * (ε_cond - ε_uncond)

The guidance weight w controls adherence to the prompt. w=7.5 is typical for Stable Diffusion. Higher w → more prompt-faithful but less diverse and potentially oversaturated. This is why “CFG scale” is a slider in most image generation UIs.

Latent Diffusion: Why Stable Diffusion is Fast

Running diffusion in pixel space at 512×512 is expensive. Stable Diffusion (Rombach et al., 2022) operates in a compressed latent space: a pretrained VAE encodes images into 64×64×4 latent vectors. Diffusion happens there. The decoder projects back to pixels.

This 64x compression is why SD can run on consumer hardware. The tradeoff: the VAE creates artifacts — the characteristic soft edges in AI images come partly from decoding back through the VAE, not from the diffusion model itself.

Transformer-Based Generative Models: Why Scale Is Nonlinear

The Emergent Behavior Problem

LLMs exhibit emergent capabilities — tasks they couldn’t do at smaller scale that they suddenly can at larger scale, with no smooth transition. Wei et al. (2022) documented this across dozens of tasks. A model trained on N tokens fails at 3-digit arithmetic; double the training compute and it suddenly succeeds.

Why? Hypotheses include:

Phase transitions: Multiple sub-skills must all work simultaneously; they all fail until they all succeed
Percolation: The skill requires a “critical path” of learned components; once all components are acquired, the skill unlocks
Measurement artifacts: Some benchmarks are binary (right/wrong), hiding smooth improvements in partial credit

This is practically important because it means capability forecasting is hard. A model 10x larger isn’t simply 10x better — it might not be meaningfully better until it crosses an unknown threshold, then be dramatically better.

Scaling Laws: The Chinchilla Correction

The original OpenAI scaling laws (Kaplan et al., 2020) suggested more parameters were usually better than more data. This led to the GPT-3 approach: 175B parameters, relatively modest training data.

Hoffmann et al. (2022) — the Chinchilla paper — corrected this. Optimal scaling requires roughly equal scaling of parameters and tokens. Chinchilla (70B params, 1.4T tokens) outperformed GPT-3 (175B params, 300B tokens) on most benchmarks while being 2.5x cheaper to run inference on.

The post-Chinchilla era: every major lab shifted toward training smaller models on more data. Llama 3 (8B) was trained on 15T tokens. Inference cost per token drops as the model shrinks, and smaller models can be deployed on edge hardware.

Instruction Tuning and RLHF: The Gap Between Pretraining and Usefulness

A pretrained LLM is a text completion engine. Ask “What is the capital of France?” and it might continue with ”? This is a common geography question that…” because that’s what follows such prompts in its training distribution.

Instruction tuning fine-tunes on (instruction, response) pairs to teach the model to answer directly. RLHF (Reinforcement Learning from Human Feedback, Christiano et al., 2017; popularized by InstructGPT 2022) goes further:

Sample multiple completions from the model
Have humans rank them
Train a reward model on the rankings
Fine-tune the LLM with PPO to maximize reward

The practical result: the same base model, after RLHF, becomes dramatically more useful as an assistant. It also becomes more “helpful-harmless-honest” — less likely to confidently confabulate.

Constitutional AI (Anthropic, 2022) reduces reliance on human labelers by having the model critique its own outputs against a set of principles. Claude is trained this way.

Mode Collapse and Training Instabilities

GAN Mode Collapse

The adversarial training dynamic makes GANs notoriously unstable. Mode collapse occurs when the generator learns to produce a small subset of plausible outputs (e.g., one face type) because the discriminator fails to penalize repetition. Symptoms: all generated samples look similar; training curves oscillate.

Mitigations:

Wasserstein GAN (2017): Replaces the discriminator with a critic, uses Wasserstein distance — more stable gradients
Minibatch discrimination: Lets discriminator compare samples in a batch
Progressive growing (PGGAN): Train at low resolution, gradually add layers for higher resolution

Despite these fixes, GANs largely lost ground to diffusion models after 2022. Diffusion training is stable by construction.

Memorization vs. Generalization

A key research question: do generative models memorize or generalize? The answer is both, depending on how often training examples appear.

Carlini et al. (2023) showed that data occurring many times in training has a measurable probability of being extractable. Single-occurrence data rarely is. This has implications for privacy and copyright — models trained on proprietary data can sometimes reproduce it verbatim under specific prompts.

Watermarking schemes (e.g., Kirchenbauer et al., 2023) embed statistical signatures in LLM output to enable provenance detection — important as synthetic text becomes pervasive.

Inference Cost: The Hidden Constraint

Training a GPT-4-scale model costs ~$100M. But inference happens at massive scale. At ChatGPT’s peak reported usage (100M+ users/day), inference cost can exceed training cost within weeks.

This creates strong pressure toward:

Quantization: Reducing parameter precision from float16 to int8 or int4, with 2-4x memory savings and modest quality loss
Speculative decoding: A small draft model generates tokens; a larger model verifies batches — same quality, 2-3x speedup
KV cache management: Attention’s quadratic complexity is amortized by caching key-value pairs across tokens; efficient cache management is critical for long contexts
Mixture of Experts (MoE): Route each token through a subset of “expert” networks. GPT-4 is reportedly MoE with ~8 active experts per token from ~16 total. Same parameter count, fraction of compute per token.

What Comes Next

Multimodal integration is already here — GPT-4V, Gemini Ultra, Claude 3 all reason across text and images. Video models (Sora) process spatiotemporal data at the cost of enormous compute.

Test-time compute scaling is the 2025 frontier. Instead of bigger parameters, give the model more tokens to think with (chain-of-thought, o1-style reasoning). The scaling curve for inference-time compute appears to continue where pre-training curves flatten.

Regulatory pressure will reshape training pipelines. The EU AI Act (in force 2024–2026) requires transparency about training data. US copyright litigation (Getty vs. Stability, various authors vs. OpenAI) will establish whether training on copyrighted data without a license is infringement. The outcome will either be licensing deals at scale or a shift to synthetic training data.

One Thing to Remember

The reason generative AI feels magical is that it learned the joint distribution of human expression well enough to sample from it. The reason it fails predictably is that sampling from a distribution isn’t the same as reasoning about it — and the difference only shows up when you push it off the distribution’s edge.

techaigenerative-aidiffusion-modelsllmtransformerrlhftraining