Large Language Models — Core Concepts

Why LLMs feel intelligent when they're really just very good at statistics — and why that distinction actually matters when you use them.

What’s Actually Going On Inside ChatGPT

By now you’ve used one of these things. But “AI that answers questions” undersells — and oversells — what they actually are. Let’s be precise.

A large language model is a statistical model that predicts text. Given some input text, it outputs a probability distribution over possible next tokens (roughly: words or word-pieces), picks one, appends it, and repeats. That’s the whole loop. What makes it remarkable is how much has to be learned to do that well.

The Architecture: Transformers

The modern LLM era started in 2017 when a Google team published a paper called Attention Is All You Need. They introduced the transformer architecture — the design that every major LLM today (GPT-4, Gemini, Claude, Llama) is built on.

The key innovation was the attention mechanism: a way for the model to consider how much each word in the input relates to every other word, simultaneously. This is what lets the model handle context across a long paragraph — understanding that “it” in “the dog chased the cat until it hid” refers to the cat, not the dog.

Before transformers, this kind of long-range understanding was hard. Earlier models (RNNs, LSTMs) processed text word-by-word and struggled with anything more than a few sentences back. Transformers solved this by looking at everything at once.

Scale Is the Secret Ingredient

The transformer wasn’t immediately world-changing. What changed everything was OpenAI’s bet in 2020 that making the model bigger would make it smarter — not just at completing sentences, but at reasoning, coding, and understanding nuance.

GPT-3, released that year, had 175 billion parameters. A parameter is a number the model can adjust during training — basically a dial that gets tuned millions of times until the model gets better at predictions. More dials = more room to store complex patterns.

The relationship between scale and capability turned out to be nonlinear. Models go through emergent behavior — they unlock new abilities at certain size thresholds that smaller models simply don’t have. Arithmetic. Step-by-step reasoning. Code generation. Nobody explicitly taught GPT-3 to write Python. It picked that up as a side effect of reading enough GitHub.

Training vs. Inference

Two totally different phases:

Training is expensive and done once (or rarely). You feed the model enormous amounts of text, compute what it would have predicted vs. what actually came next, then adjust the parameters to reduce that error. Repeat billions of times. GPT-4’s training reportedly cost over $100 million in compute. Meta’s Llama 3 was trained on 15 trillion tokens.

Inference is what happens when you type a message. The (now frozen) model just runs the forward pass — no learning happening, just math going in one direction. This is much cheaper, and what every API call is doing.

Most people conflate these. The model you’re chatting with stopped learning the moment training ended.

The Common Misconception: LLMs Don’t “Know” Things

This is where most people get surprised. An LLM doesn’t retrieve facts from a database. It doesn’t look things up. Everything it “knows” is encoded in those billions of parameters as statistical associations — words that tend to appear together, sentence patterns that correlate with correct answers.

When it gives you a confident wrong answer, it’s not malfunctioning. It’s doing exactly what it was trained to do: generate the most plausible-looking continuation of your prompt. Unfortunately, “plausible-looking” and “factually correct” don’t always match.

This is why LLMs hallucinate: they produce fluent text about things that aren’t true, with the same confidence they use for things that are.

Fine-Tuning and RLHF

Raw pretrained models are weird to talk to — they complete your sentences instead of answering questions. OpenAI’s ChatGPT breakthrough came from a two-step process on top of pretraining:

Supervised fine-tuning: Human trainers wrote example conversations showing how the model should behave.
RLHF (Reinforcement Learning from Human Feedback): Human raters compared model outputs and ranked them. Those rankings trained a separate “reward model” that scored responses. The main model was then tuned to maximize those scores.

The result is a model that feels helpful and conversational rather than like an autocomplete engine. This process — not the scale alone — is what made ChatGPT feel like a step change when it launched in November 2022.

Context Windows

Every LLM has a context window: the maximum amount of text it can “see” at once, measured in tokens (1 token ≈ 0.75 words in English). Early GPT-3 had 4,096 tokens. Current models range from 128,000 (GPT-4o) to 1 million (Gemini 1.5 Pro).

This matters practically: if a document is longer than the context window, the model can’t read it all at once. And even within the window, models tend to pay more attention to the beginning and end — the “lost in the middle” problem.

One Thing to Remember

LLMs don’t understand language — they’re extraordinarily good at modeling its statistical structure. That distinction is philosophically interesting and practically important: it explains both why they’re so capable and exactly where they break.

aillmchatgpttransformerslanguage-modelsnlp