Chain-of-Thought Reasoning — Core Concepts
The Paper That Changed Everything
Wei et al. (Google, 2022) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” showed that providing a few examples of step-by-step reasoning dramatically improved LLM performance on multi-step problems.
The key comparison on GSM8K (grade school math):
- Standard prompting (just Q→A examples): ~18% accuracy
- CoT prompting (Q + reasoning steps → A examples): ~57% accuracy
This 3x improvement required zero additional training — just changing the prompt format.
The result was striking enough that it seemed too good to be true, but it was robustly replicated across many benchmarks and models. The capability was always there; CoT prompting unlocked it.
Few-Shot CoT vs. Zero-Shot CoT
Few-shot CoT: Provide 4–8 examples of (question + reasoning chain + answer) before the actual question. The model learns the reasoning format from examples.
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls?
A: Roger starts with 5. 2 cans × 3 = 6 more. Total: 5 + 6 = 11. Answer: 11.
Q: The cafeteria had 23 apples. 20 used for lunch. 6 more bought. How many?
A: [LLM generates reasoning chain here]
Zero-shot CoT (Kojima et al., 2022): Add “Let’s think step by step” to any question, without examples. This single phrase triggers extended reasoning in large models.
The zero-shot variant works only for sufficiently large models — GPT-3 and above, approximately 100B+ parameters. Smaller models don’t benefit from CoT because they lack the underlying reasoning capabilities to leverage the structure.
Self-Consistency: Sampling Multiple Reasoning Paths
A single CoT answer can be wrong if the model takes a wrong reasoning branch. Wang et al. (2022) “Self-Consistency Improves Chain-of-Thought Reasoning in Language Models” addressed this.
Instead of generating one reasoning chain, generate $K$ independent chains (using temperature sampling for diversity), then take the majority vote on the final answers.
Chain 1: ... → 11
Chain 2: ... → 11
Chain 3: ... → 9 (wrong path)
Chain 4: ... → 11
Chain 5: ... → 11
Majority vote: 11 ✓
With $K=40$ samples, self-consistency improved GSM8K from 57% to 78%. The improvement is consistent — diverse wrong paths rarely vote the same wrong answer, while correct paths tend to converge.
Cost: 40x more compute. For high-stakes decisions (medical, legal, scientific), this is often worth it.
Least-to-Most Prompting and Decomposition
Some problems are too complex for direct CoT. Least-to-most prompting (Zhou et al., 2022):
- Decompose the question into simpler sub-questions
- Solve the simpler sub-questions first
- Use their solutions to answer the original question
Question: What is the boiling point of the most common element on Earth?
Step 1: What is the most common element on Earth? → Oxygen
Step 2: What is the boiling point of oxygen? → -183°C
Answer: -183°C
This enables solving “compositional” problems where the answer to sub-questions informs later sub-questions. On some symbolic reasoning benchmarks, least-to-most achieves near-perfect accuracy where standard CoT plateaus.
Tree of Thoughts
Yao et al. (2023) “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” extended linear chains to tree-structured search.
For a problem with multiple possible reasoning paths:
- Generate multiple “thought” branches at each step
- Evaluate each branch (using the LLM itself as evaluator)
- Select the most promising branch to continue
- Backtrack if a branch leads to a dead end
This enables deliberate planning — the model explores multiple approaches and commits to the best one rather than following one chain sequentially.
ToT dramatically improved performance on creative writing and puzzles (24-game: reach 24 using arithmetic operations) where single-path reasoning often fails: 4% accuracy with standard CoT → 74% accuracy with ToT.
o1 and the Reasoning Model Paradigm
OpenAI’s o1 (2024) represented a fundamental shift: instead of chain-of-thought as a prompting technique, it’s baked into the training and inference process.
o1 is trained with reinforcement learning to produce extended “reasoning tokens” — an internal scratchpad where the model thinks through problems — before generating the final response.
Key differences from prompting-based CoT:
- Longer chains: o1 can reason for thousands of tokens before answering
- Self-correction: o1 explicitly reconsiders and revises during its chain
- Scaling: More “thinking” time → better performance (test-time compute scaling)
- Hidden reasoning: The reasoning chain is hidden from users (only the answer is shown)
On competition math (AIME 2024): GPT-4o → ~13% accuracy; o1-preview → ~56% accuracy. On GPQA (graduate-level science): GPT-4o → 53%; o1 → 78%.
The “o1 insight”: intelligence can be partially decomposed into more thinking time on hard problems. This suggests a new scaling axis beyond training compute — inference time compute — and motivates the “reasoning model” paradigm where models reason extensively before answering.
One thing to remember: Chain-of-thought’s importance extends beyond prompting — it revealed that LLMs have latent reasoning capabilities that require explicit sequential expression, and this insight is now being leveraged at the architecture and training level to create fundamentally more capable reasoning systems.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'