Generative AI — Core Concepts
What Generative AI Actually Is
Most AI is about classification. Show it a photo, it says “cat.” Feed it a transaction, it says “fraud” or “not fraud.” These systems learn to label or predict from existing data.
Generative AI flips this. Instead of labeling existing things, it creates new things. New text, new images, new audio, new video, new code. The input is a prompt or some noise; the output is something that didn’t exist before.
This isn’t new in principle — researchers were building generative models in the 1980s. What changed is scale. Between 2020 and 2023, compute got cheap enough, datasets got big enough, and architectures got smart enough that the output stopped looking like a blur of approximations and started looking like something a human made.
The Three Main Architectures
Generative Adversarial Networks (GANs) — 2014
Ian Goodfellow invented this while arguing with colleagues in a Montreal bar. The setup is adversarial: two neural networks compete.
The generator creates fake images. The discriminator tries to catch them. They train together — the generator learns to fool the discriminator, the discriminator learns to spot fakes. When they’re evenly matched, the generator is producing images so convincing the discriminator is basically guessing.
GANs gave us DeepFakes, DALL-E v1, and Nvidia’s PhotoRealGAN. The problem: they’re unstable to train and they specialize. A GAN trained on faces makes faces. You can’t just ask it to also write poetry.
Diffusion Models — went mainstream 2021–2022
This is how Stable Diffusion, DALL-E 2, and Midjourney work. The idea is almost backwards from what you’d expect.
Training involves systematically destroying an image by adding random noise, step by step, until it’s pure static. The model learns to reverse this process — to predict what the clean image looked like given the noisy version.
At inference time, you start with pure noise and ask the model to denoise it, guided by your text prompt. The result: an image that plausibly fits your description, generated from nothing but structured chaos.
Diffusion models are slower than GANs but more controllable and more versatile. Stable Diffusion (released open-source in August 2022) blew up partly because the whole 4 GB model could run on a $400 graphics card.
Large Language Models (LLMs)
GPT-4, Claude, Gemini, Llama — these are text-first generative models based on the Transformer architecture (2017). They learn by reading enormous amounts of text and predicting what token comes next. They’re so good at next-token prediction that when they chain millions of predictions together, they produce coherent paragraphs, arguments, code, stories.
Crucially, LLMs generalized to instruction-following once they hit a certain scale. GPT-2 (2019) could complete sentences. GPT-3 (2020) could write essays. GPT-4 (2023) could pass the bar exam. Something nonlinear happened at scale.
Why It Exploded After 2022
Three things converged:
- Compute: A100 and H100 GPUs gave researchers far more processing power at reduced cost
- Scale: Training runs started using trillions of tokens instead of billions
- Instruction tuning + RLHF: Researchers learned how to align raw language models to follow instructions, not just complete them — making them actually usable
ChatGPT launched in November 2022 and reached 100 million users in two months. Faster than any app in history at the time.
What Generative AI Can and Can’t Do
Can do well:
- Generate plausible-sounding text, images, audio
- Transform content (translate, summarize, reformat)
- Complete patterns it’s seen variations of in training
- Code in widely used programming languages
Can’t do well:
- Reliably get facts right (it’s not a database)
- Reason about novel situations it has no training analog for
- Count letters, do arithmetic without tools
- Know what’s happening in the world after its training cutoff
The most important misconception: generative AI models don’t understand what they’re generating in any human sense. An LLM producing the word “Paris” doesn’t know what Paris is. It knows that “Paris” tends to follow “capital of France” in the text it was trained on. For most practical purposes, the distinction doesn’t matter. For some purposes (medical, legal, safety-critical), it matters a lot.
Real-World Applications Right Now
Text: Writing assistance (Grammarly, Notion AI), code generation (GitHub Copilot), customer service bots, search summaries
Images: Marketing visuals, concept art, product photography, medical imaging augmentation
Audio: Voice cloning, music generation (Suno, Udio), podcast production, accessibility tools
Video: Sora (OpenAI), Runway Gen-3 — still expensive and imperfect, but improving fast
The Legitimate Concerns
Generative AI can produce convincing disinformation at scale. Deepfake audio of politicians. Synthetic scientific papers. Personalized phishing emails.
There’s also a real copyright murkiness. These models trained on human-created content. When they produce work that resembles a specific artist’s style, is that fair use or theft? Courts in the US and EU are currently sorting this out — expect rulings in 2025–2026 to reshape how these models are built and licensed.
One Thing to Remember
Generative AI creates new content by learning the statistical patterns of human-made content. It’s powerful not because it understands, but because it imitates at a scale and resolution that humans can’t distinguish from the real thing — which is both what makes it useful and what makes it risky.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.