AI Hallucinations — Deep Dive

The mechanistic reasons language models hallucinate — from probability distributions to knowledge boundaries — plus a practical breakdown of mitigation strategies and their actual limitations.

The Mechanistic Root

To understand hallucinations properly, you have to start with what a language model actually outputs. It doesn’t output words. It outputs a probability distribution over tokens.

At every step, the model considers its entire context window and assigns a probability to every possible next token in the vocabulary (often 50,000–100,000 tokens). The token with the highest probability is selected (in greedy decoding) or sampled from (with temperature > 0). Then that token becomes part of the context, and the process repeats.

The model has no “speak only when certain” mode. There’s no separate confidence score being tracked. The distribution over tokens is the output, period.

This is critical: the model will always generate some next token, even when the correct answer is “I have no idea.” The “I don’t know” response has to be learned as a pattern — and even then, it only fires when the training data made uncertainty-expressions more probable than a specific answer. For many topics, a specific confident-sounding answer has higher probability than “I’m not sure.”

Where Hallucinations Come From: A Taxonomy

1. Training Data Gaps

The model has seen a vast corpus, but not everything. For entities with sparse coverage — obscure people, niche technical papers, small companies, events after the training cutoff — the model has weak signal. It doesn’t know it has weak signal. It interpolates from similar entities.

A model asked about “Vladimír Horák, Czech mathematician, 1952” might produce a plausible biography assembled from patterns it’s seen around other mid-century Eastern European academics. Every detail feels right in isolation. The combination may be entirely fabricated or may match a real person by accident.

2. Association vs. Memorization

Language models learn associations between tokens, not pure memorization. A model that saw “penicillin” and “Alexander Fleming” thousands of times together can reliably produce one from the other. But for facts that appeared rarely or in inconsistent contexts, the association is weak — and the model fills the gap with the most statistically plausible continuation.

This is why models hallucinate more about:

Low-frequency proper nouns (lesser-known politicians, minor historical figures)
Specific numbers (dates, statistics, financial figures) that vary widely across sources
Citations and references, where the structure (author, year, journal) is common but the specific combination may not be

3. Sycophancy and Training Pressure

RLHF (Reinforcement Learning from Human Feedback) — the technique used to align models like GPT-4 and Claude — creates a subtle pressure toward confident-sounding answers. Human raters tend to prefer responses that sound authoritative over responses that hedge heavily. This bias propagates into the model’s output distribution.

A model trained purely to maximize human approval ratings may learn that “The Treaty of Westphalia was signed in 1648” gets better ratings than “I believe it was around the mid-17th century but you should verify.” Over millions of rating events, this pushes the model toward false confidence.

Anthropic and OpenAI have both documented this dynamic. Reducing sycophancy without making models frustratingly hedge-everything is a real alignment challenge.

4. Knowledge Boundary Blindness

The model has no metadata about its own training. It can’t query “how many times did I see this fact?” or “was this source reliable?” Its training process updated millions of parameter weights, but those weights don’t come with provenance information.

This is structurally different from, say, a database with a confidence column or a retrieval system that can return “no results found.” The weights simply encode statistical patterns, and the model reasons from those patterns without knowing where they came from.

Measuring Hallucination Rate

Several benchmarks attempt to quantify hallucination frequency:

TruthfulQA (Lin et al., 2022): ~800 questions adversarially designed to elicit common human misconceptions. Tests whether models can avoid confidently repeating false beliefs. GPT-4 scores roughly 60-70% on this — better than GPT-3.5 (~50%), worse than ideal.
HaluEval: A dataset of 35,000 examples specifically testing summarization, QA, and dialogue for fabrications.
FELM (Factuality Evaluation of Language Models): Breaks hallucination by domain — science, math, writing, world knowledge. Models perform very differently by domain; math tends to be better than history or biography.

The numbers you see in marketing materials are often on curated benchmarks. Real-world hallucination rates for open-ended queries remain higher.

Mitigation Strategies: What Actually Works

Retrieval-Augmented Generation (RAG)

RAG is currently the most effective production-grade mitigation. The architecture:

User query → embedding model → vector similarity search over a document corpus
Top-k documents retrieved → prepended to the prompt context
Model generates answer grounded in retrieved documents, not just training memory

The key insight: when the answer is in the context window, the model behaves as a reader/summarizer rather than a memory-retriever. Summarization hallucinations still occur, but at much lower rates than pure knowledge recall.

Limitations:

Only works when the correct information is in the retrieval corpus
Retrieval errors (wrong chunks, chunking artifacts) cause their own failure modes
Context window limits constrain how much can be retrieved
Models can still “drift” from retrieved content into confabulation mid-generation

Chain-of-Thought and Self-Consistency

Prompting models to reason step-by-step before answering reduces hallucination on structured tasks (math, logic, multi-step inference). The hypothesis: forcing explicit reasoning surfaces contradictions before they become confident wrong answers.

Self-consistency sampling — generating multiple reasoning chains and taking the majority answer — further reduces error rates for factual questions. Expensive but measurable.

Grounding and Citation Requirements

Requiring the model to cite sources for each factual claim creates implicit self-monitoring. If the model can’t point to a retrieved document supporting a claim, the absence becomes detectable. This doesn’t stop fabrication of citations (see the Schwartz case above) unless citations are verified against a corpus.

Some production systems (Perplexity, Bing Copilot) automatically verify that quoted text appears verbatim or near-verbatim in retrieved documents. Fabricated quotes fail this check.

Calibrated Uncertainty Expression

Trained specifically, models can learn to express graded uncertainty: “I’m confident that…”, “I believe but am less certain that…”, “I don’t have reliable information about…”. This requires targeted training data with calibrated labels.

Current SOTA models do this somewhat, but calibration degrades on out-of-distribution questions — which is exactly the category where hallucination is most likely.

What Doesn’t Work (and Why)

“Just tell it not to make things up” — Prompt instructions like “only state facts you’re certain of” have modest effects. They shift the output distribution slightly toward hedging, but the underlying mechanistic issue (the model doesn’t know what it knows) is unchanged.

Fine-tuning on a domain — This helps for within-domain recall but doesn’t prevent hallucination on edge cases within that domain. A medical fine-tuned model hallucinates about rare drugs or unusual presentations rather than common ones.

Larger models hallucinate less, but not never — GPT-4 vs GPT-3.5 is a real improvement. But scaling alone doesn’t solve the problem; it shifts the distribution of what gets hallucinated.

Emerging Research Directions

Factuality-tuned models: Training explicitly on factuality reward signals rather than general human preference. Factuality probing mid-generation to detect when the model is in uncertain territory.

Activation steering: Directly manipulating model internals (residual stream activations) to increase honesty-related features while generating. Anthropic and various academic groups have published early results on this.

External verifier loops: LLM generates answer → separate verifier model or retrieval system checks claims → flagged claims get regenerated or removed. Adds latency but dramatically reduces false-confident output.

Tool use and grounding: Models that can invoke web search or databases mid-generation only say things they can back up in real-time. Current practice in agents (OpenAI Assistants, Claude Tools, etc.) is moving this direction.

The Honest Prognosis

Hallucination rates have been improving roughly 30-50% per model generation, but the problem is structurally bound to next-token prediction. A model that generates text token-by-token, without explicit fact lookup, will always be capable of confabulation. The question is how often and how detectable.

For high-stakes factual applications — legal, medical, financial — the practical engineering answer is RAG + verification layers, not raw model improvement. For general-purpose assistants, the answer is better calibration and user education: knowing when to trust and when to verify.

The parrot analogy breaks down here. Parrots are cute and obviously not authoritative. Language models sound exactly like competent humans explaining things. That gap between appearance and reliability is the real challenge.

One Thing to Remember

A language model outputs probability distributions over tokens, not facts. The token “Paris” is likely after “the capital of France is” because millions of training documents made that association strong — not because the model “knows” Paris is the capital. Everything follows from this. When the statistical associations are reliable, it looks like knowledge. When they’re not, it looks like a confident lie.

aillmhallucinationstransformersragrlhfnlpgpt