Retrieval Augmented Generation — Core Concepts
The Two Problems RAG Solves
Large language models have a quirky limitation nobody talks about enough in the marketing materials: they’re frozen in time.
When a company trains GPT-4 or Llama 3, they feed it an enormous snapshot of the internet — hundreds of terabytes of text, scraped up to a specific date. Everything after that “cutoff” is a black hole. Ask about a news story from last week, a change in your company’s pricing policy, or literally anything that happened after training ended, and the model is either going to say “I don’t know” or — more dangerously — confidently make something up.
That second failure mode has a name: hallucination. It’s not a bug in the traditional sense. The model is doing exactly what it was trained to do — generating plausible-sounding text. It just doesn’t have a mechanism to distinguish “I genuinely know this” from “this sounds right based on patterns I’ve seen.”
RAG attacks both problems with the same mechanism.
How RAG Actually Works
The process has three distinct phases:
Phase 1: Ingestion (Building the Library)
Before a user ever asks a question, you preprocess your documents. Say you have 50,000 customer support tickets, a 200-page product manual, and 3 years of internal Slack messages. You want your AI assistant to be able to answer questions about all of this.
Each document (or chunk of a document — typically 200-500 words) gets converted into a vector embedding: a list of hundreds or thousands of numbers that captures the meaning of the text in a mathematical form. Similar-meaning text ends up with similar numbers, even if the words are completely different.
These vectors get stored in a vector database — specialized databases like Pinecone, Weaviate, or pgvector (a Postgres extension). Chroma is popular for local development.
Phase 2: Retrieval (Finding the Relevant Stuff)
When a user submits a query, the system converts that query into a vector embedding too, using the same model. Then it searches the vector database for the chunks that are most mathematically similar — which in practice means “most semantically relevant.”
A search for “how do I cancel my subscription?” will surface chunks about cancellation, refund policies, billing, and similar topics — even if none of those chunks contain the exact phrase “cancel my subscription.”
The system typically retrieves the top-k most relevant chunks (often k=3 to 10, depending on context window size and desired cost). These chunks become context.
Phase 3: Generation (Answering With Sources)
Now the language model gets a prompt that looks roughly like this:
You are a helpful assistant. Answer the user's question based ONLY on the context below.
CONTEXT:
[chunk 1 text]
[chunk 2 text]
[chunk 3 text]
USER QUESTION:
How do I cancel my subscription?
ANSWER:
The model generates an answer grounded in the retrieved text. Because it has actual relevant information in front of it right now — not just vague training memory — the answer is more accurate and more up-to-date.
The Vector Embedding Magic
This is the piece most explanations gloss over, and it’s worth a minute.
Human language is messy. “Car,” “automobile,” “vehicle,” and “sedan” mean related things, but a keyword search won’t connect them unless you explicitly list all the synonyms. A vector embedding turns words and sentences into points in a high-dimensional space, where meaning determines location.
“The dog chased the cat” and “A hound pursued the feline” end up very close together in that space. “The quarterly earnings report exceeded analyst expectations” ends up somewhere completely different.
This is what makes semantic search so much better than keyword search for retrieval. It’s also why the choice of embedding model matters a lot — OpenAI’s text-embedding-3-large, Cohere’s embed-v3, and open models like nomic-embed-text have meaningfully different quality.
Common Misconception: RAG vs. Fine-Tuning
People often ask: why not just fine-tune the model on your data instead?
Fine-tuning is expensive (thousands of dollars for a serious run), slow (days to weeks), and becomes stale immediately — the moment your documents change, your fine-tuned model is out of date and you have to redo the whole process.
RAG is cheap (mostly just embedding costs, often fractions of a cent per document), fast to set up, and automatically benefits from document updates. Add a new document to the store, and it’s instantly searchable — no retraining needed.
The tradeoff: fine-tuning bakes knowledge into the model’s weights, so it can generalize more flexibly. RAG is more of a lookup mechanism — the model needs to be able to reason from context it hasn’t seen before.
Most production deployments use both: a base model fine-tuned for task-specific behavior (tone, format, domain vocabulary) combined with RAG for current, accurate facts.
The Chunk Size Problem
One of the most underappreciated design decisions in a RAG system is chunking strategy: how you split documents before embedding them.
Too small (50 words): You lose context. “The deadline is Friday” means nothing without knowing what deadline.
Too large (2,000 words): Retrieval becomes noisy. The chunk might be relevant for 10% of its content, but it takes up precious context window space with the other 90%.
Most practitioners land between 200–600 tokens per chunk, often with overlap (the last 50 tokens of chunk N appear at the start of chunk N+1) to prevent answers from getting cut in half at a boundary.
The right chunk size depends heavily on your document types. Legal contracts behave very differently from chat transcripts.
One Thing to Remember
RAG is not a replacement for a good language model — it’s a way to give that model access to your specific facts at the moment they’re needed. The AI still does the reasoning; RAG handles the “what does this document actually say” part.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'