Generative AI — Deep Dive
The Core Technical Landscape
Generative modeling is, at its core, a problem of learning a probability distribution. Given a dataset of images, text, or audio, we want to learn the distribution P(x) well enough to sample new x that look like they came from that distribution.
The question that divides architectures is: how do you represent and sample from that distribution efficiently?
Diffusion Models: The Math That Actually Works
Forward Process: Destroying Data Systematically
Diffusion models (Ho et al., 2020, “Denoising Diffusion Probabilistic Models”) define a Markov chain that gradually corrupts data with Gaussian noise over T timesteps:
q(x_t | x_{t-1}) = N(x_t; sqrt(1-β_t) * x_{t-1}, β_t * I)
Where β_t is a noise schedule (linear, cosine, or learned). After enough steps, x_T ≈ N(0, I) — pure noise. The key insight from DDPM is that you can derive a closed-form for the noisy image at any timestep t directly from x_0:
x_t = sqrt(ᾱ_t) * x_0 + sqrt(1 - ᾱ_t) * ε
where ᾱ_t = Π(1 - β_i) for i=1..t
This means you don’t need to run the full chain during training — you sample t randomly, corrupt x_0 directly to x_t, and train the network to predict ε.
Reverse Process: Denoising as Generation
The model learns a neural network ε_θ(x_t, t) that predicts the noise added to x_0 to get x_t. At inference, starting from x_T ~ N(0,I), you iteratively denoise:
x_{t-1} = (1/sqrt(α_t)) * (x_t - (β_t/sqrt(1-ᾱ_t)) * ε_θ(x_t, t)) + sigma_t * z
This is slow. The original DDPM required 1,000 denoising steps. DDIM (Song et al., 2020) reduced this to 20-50 steps with no meaningful quality loss by framing diffusion as an ODE rather than a stochastic process.
Classifier-Free Guidance: How Text Prompts Actually Work
Conditional diffusion models (text-to-image) use classifier-free guidance (Ho & Salimans, 2021). During training, the text conditioning c is randomly dropped 10-20% of the time. At inference, you compute two noise predictions — conditioned and unconditioned — and extrapolate:
ε_guided = ε_uncond + w * (ε_cond - ε_uncond)
The guidance weight w controls adherence to the prompt. w=7.5 is typical for Stable Diffusion. Higher w → more prompt-faithful but less diverse and potentially oversaturated. This is why “CFG scale” is a slider in most image generation UIs.
Latent Diffusion: Why Stable Diffusion is Fast
Running diffusion in pixel space at 512×512 is expensive. Stable Diffusion (Rombach et al., 2022) operates in a compressed latent space: a pretrained VAE encodes images into 64×64×4 latent vectors. Diffusion happens there. The decoder projects back to pixels.
This 64x compression is why SD can run on consumer hardware. The tradeoff: the VAE creates artifacts — the characteristic soft edges in AI images come partly from decoding back through the VAE, not from the diffusion model itself.
Transformer-Based Generative Models: Why Scale Is Nonlinear
The Emergent Behavior Problem
LLMs exhibit emergent capabilities — tasks they couldn’t do at smaller scale that they suddenly can at larger scale, with no smooth transition. Wei et al. (2022) documented this across dozens of tasks. A model trained on N tokens fails at 3-digit arithmetic; double the training compute and it suddenly succeeds.
Why? Hypotheses include:
- Phase transitions: Multiple sub-skills must all work simultaneously; they all fail until they all succeed
- Percolation: The skill requires a “critical path” of learned components; once all components are acquired, the skill unlocks
- Measurement artifacts: Some benchmarks are binary (right/wrong), hiding smooth improvements in partial credit
This is practically important because it means capability forecasting is hard. A model 10x larger isn’t simply 10x better — it might not be meaningfully better until it crosses an unknown threshold, then be dramatically better.
Scaling Laws: The Chinchilla Correction
The original OpenAI scaling laws (Kaplan et al., 2020) suggested more parameters were usually better than more data. This led to the GPT-3 approach: 175B parameters, relatively modest training data.
Hoffmann et al. (2022) — the Chinchilla paper — corrected this. Optimal scaling requires roughly equal scaling of parameters and tokens. Chinchilla (70B params, 1.4T tokens) outperformed GPT-3 (175B params, 300B tokens) on most benchmarks while being 2.5x cheaper to run inference on.
The post-Chinchilla era: every major lab shifted toward training smaller models on more data. Llama 3 (8B) was trained on 15T tokens. Inference cost per token drops as the model shrinks, and smaller models can be deployed on edge hardware.
Instruction Tuning and RLHF: The Gap Between Pretraining and Usefulness
A pretrained LLM is a text completion engine. Ask “What is the capital of France?” and it might continue with ”? This is a common geography question that…” because that’s what follows such prompts in its training distribution.
Instruction tuning fine-tunes on (instruction, response) pairs to teach the model to answer directly. RLHF (Reinforcement Learning from Human Feedback, Christiano et al., 2017; popularized by InstructGPT 2022) goes further:
- Sample multiple completions from the model
- Have humans rank them
- Train a reward model on the rankings
- Fine-tune the LLM with PPO to maximize reward
The practical result: the same base model, after RLHF, becomes dramatically more useful as an assistant. It also becomes more “helpful-harmless-honest” — less likely to confidently confabulate.
Constitutional AI (Anthropic, 2022) reduces reliance on human labelers by having the model critique its own outputs against a set of principles. Claude is trained this way.
Mode Collapse and Training Instabilities
GAN Mode Collapse
The adversarial training dynamic makes GANs notoriously unstable. Mode collapse occurs when the generator learns to produce a small subset of plausible outputs (e.g., one face type) because the discriminator fails to penalize repetition. Symptoms: all generated samples look similar; training curves oscillate.
Mitigations:
- Wasserstein GAN (2017): Replaces the discriminator with a critic, uses Wasserstein distance — more stable gradients
- Minibatch discrimination: Lets discriminator compare samples in a batch
- Progressive growing (PGGAN): Train at low resolution, gradually add layers for higher resolution
Despite these fixes, GANs largely lost ground to diffusion models after 2022. Diffusion training is stable by construction.
Memorization vs. Generalization
A key research question: do generative models memorize or generalize? The answer is both, depending on how often training examples appear.
Carlini et al. (2023) showed that data occurring many times in training has a measurable probability of being extractable. Single-occurrence data rarely is. This has implications for privacy and copyright — models trained on proprietary data can sometimes reproduce it verbatim under specific prompts.
Watermarking schemes (e.g., Kirchenbauer et al., 2023) embed statistical signatures in LLM output to enable provenance detection — important as synthetic text becomes pervasive.
Inference Cost: The Hidden Constraint
Training a GPT-4-scale model costs ~$100M. But inference happens at massive scale. At ChatGPT’s peak reported usage (100M+ users/day), inference cost can exceed training cost within weeks.
This creates strong pressure toward:
- Quantization: Reducing parameter precision from float16 to int8 or int4, with 2-4x memory savings and modest quality loss
- Speculative decoding: A small draft model generates tokens; a larger model verifies batches — same quality, 2-3x speedup
- KV cache management: Attention’s quadratic complexity is amortized by caching key-value pairs across tokens; efficient cache management is critical for long contexts
- Mixture of Experts (MoE): Route each token through a subset of “expert” networks. GPT-4 is reportedly MoE with ~8 active experts per token from ~16 total. Same parameter count, fraction of compute per token.
What Comes Next
Multimodal integration is already here — GPT-4V, Gemini Ultra, Claude 3 all reason across text and images. Video models (Sora) process spatiotemporal data at the cost of enormous compute.
Test-time compute scaling is the 2025 frontier. Instead of bigger parameters, give the model more tokens to think with (chain-of-thought, o1-style reasoning). The scaling curve for inference-time compute appears to continue where pre-training curves flatten.
Regulatory pressure will reshape training pipelines. The EU AI Act (in force 2024–2026) requires transparency about training data. US copyright litigation (Getty vs. Stability, various authors vs. OpenAI) will establish whether training on copyrighted data without a license is infringement. The outcome will either be licensing deals at scale or a shift to synthetic training data.
One Thing to Remember
The reason generative AI feels magical is that it learned the joint distribution of human expression well enough to sample from it. The reason it fails predictably is that sampling from a distribution isn’t the same as reasoning about it — and the difference only shows up when you push it off the distribution’s edge.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.