Diffusion Models — Core Concepts

Why Stable Diffusion, DALL-E 3, and Midjourney all use the same underlying trick — and how text turns into an image through a carefully controlled noise-removal process.

What Diffusion Models Actually Do

Diffusion models are a class of generative AI — they create new data (images, audio, video) that resembles training data but didn’t exist before. Unlike older approaches, they work by learning to reverse a destruction process.

The key insight: it’s easier to teach a neural network to remove a small amount of noise from an image than to teach it to generate an image from scratch. String enough of these small denoising steps together, and you can go all the way from pure random noise to a coherent picture.

Stable Diffusion (released publicly in August 2022, causing the first real mainstream explosion of AI-generated images), DALL-E 3, Midjourney v5+, and Adobe Firefly all use variants of this approach. Before diffusion models, the dominant approach was GANs — and diffusion models beat them on nearly every quality benchmark.

The Forward and Reverse Process

There are two phases, run in opposite directions:

Forward Process (Training Time)

Take a real image. Add a small amount of Gaussian noise. Then add a bit more. Then more. Repeat for ~1000 steps until the image is indistinguishable from random static. At each step, record what the noise looked like.

This isn’t done at inference time — it’s done millions of times during training to teach the neural network what noise looks like at every stage.

Reverse Process (Generation Time)

Start with pure random noise. The neural network — called a U-Net in most image diffusion models — looks at the noisy image and predicts: “what noise was probably added to get here?” Subtract that predicted noise. You’re left with slightly-less-noisy image. Do this again. And again. After 20–1000 steps (depending on the scheduler), you have a clean image.

The U-Net was originally designed for medical image segmentation in 2015. Diffusion researchers repurposed it because it’s good at capturing patterns at multiple scales — both fine pixel details and broad compositional structure.

Where Text Comes In: Conditioning

The model described so far generates random images. To generate specific images from text prompts, you need conditioning — a way to steer the denoising process.

Here’s how it works:

Your text prompt (“an oil painting of a red fox in a library”) gets encoded by a text model (usually CLIP or a T5 variant) into a vector — a list of numbers representing its meaning.
During each denoising step, the U-Net receives both the noisy image and this text vector.
The network learns: when this text vector is present, lean toward denoising in ways that produce images matching this description.

This is called classifier-free guidance (CFG). Most image generators expose this as a “guidance scale” slider. Higher values = images that more closely match the prompt but look slightly artificial. Lower values = more natural-looking images that might drift from what you asked for. A guidance scale of 7–9 is typical for Stable Diffusion.

Latent Diffusion: Why It’s Fast Enough to Run on a Laptop

Early diffusion models operated directly on pixels. Generating a 512×512 image meant running the U-Net on all 786,432 pixel values, hundreds of times. Absurdly slow.

The breakthrough in Stable Diffusion (the “Latent” in “Latent Diffusion Models”) was to do everything in a compressed latent space instead:

An encoder (part of a VAE — Variational Autoencoder) compresses the image 8× in each dimension. A 512×512 image becomes a 64×64 latent representation.
All the noising and denoising happens on this small latent.
After generation, a decoder expands the latent back to full resolution.

The latent space is ~64× smaller than pixel space. The U-Net runs 64× faster. This is why Stable Diffusion can generate images on a consumer GPU in seconds rather than minutes, and why the entire model fits in a few gigabytes instead of hundreds.

Key Concepts to Know

Sampling Schedulers

The 1000-step noising process during training doesn’t mean generation takes 1000 steps. Samplers like DDIM, DPM++, and Euler A are clever mathematical shortcuts that let you get good results in 20–50 steps. Different samplers have different tradeoffs: speed, detail sharpness, image diversity.

ControlNet

Released in February 2023, ControlNet lets you feed structural information — edge maps, depth maps, human pose skeletons — as additional input to guide generation. This made precise composition control possible for the first time: “generate this image but with the character in this exact pose.”

LoRA (Low-Rank Adaptation)

A technique for fine-tuning diffusion models on a small dataset (sometimes as few as 10–30 images) without retraining the entire model. LoRA files are typically 50–150MB versus the 2–7GB of the full model. Most of the character/style fine-tuning models you find online are LoRAs.

Common Misconception

“Diffusion models are just mixing together existing images.”

They’re not. A diffusion model doesn’t store or retrieve training images — it encodes statistical patterns about what pixels tend to look like near other pixels, across millions of contexts. A model trained on 5 billion images doesn’t have those images inside it; it has learned the underlying distribution.

This is why the same model can generate images of things that don’t exist — creatures that combine features of multiple animals, cityscapes from time periods that never happened. It’s genuinely generative, not a sophisticated collage tool.

The copyright debate is complicated for different reasons (the training data was images that someone created), but “it’s copying” isn’t quite technically accurate.

Why This Replaced GANs

For about six years (roughly 2016–2022), Generative Adversarial Networks were the dominant image generation method. GANs use a generator competing against a discriminator. They produced impressive results but had major problems: training instability, mode collapse (where the generator finds a few safe outputs and keeps making those), and poor text-image alignment.

Diffusion models train more stably (no adversarial game), achieve better diversity, and compose naturally with text conditioning. By mid-2022, DALL-E 2 and Stable Diffusion had made GAN-based image generation essentially obsolete for most applications.

One Thing to Remember

Diffusion models learn one small skill — removing a bit of noise — and repeat it until a coherent image appears. The power comes not from any one step but from chaining hundreds of tiny corrections, each one guided by your text prompt.

aigenerative-aidiffusion-modelsimage-generationstable-diffusion