Chain-of-Thought Reasoning — Core Concepts

Few-shot and zero-shot CoT, self-consistency sampling, tree of thoughts, and how o1-style reasoning models take chain-of-thought from prompting trick to architectural paradigm.

The Paper That Changed Everything

Wei et al. (Google, 2022) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” showed that providing a few examples of step-by-step reasoning dramatically improved LLM performance on multi-step problems.

The key comparison on GSM8K (grade school math):

Standard prompting (just Q→A examples): ~18% accuracy
CoT prompting (Q + reasoning steps → A examples): ~57% accuracy

This 3x improvement required zero additional training — just changing the prompt format.

The result was striking enough that it seemed too good to be true, but it was robustly replicated across many benchmarks and models. The capability was always there; CoT prompting unlocked it.

Few-Shot CoT vs. Zero-Shot CoT

Few-shot CoT: Provide 4–8 examples of (question + reasoning chain + answer) before the actual question. The model learns the reasoning format from examples.

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls?
A: Roger starts with 5. 2 cans × 3 = 6 more. Total: 5 + 6 = 11. Answer: 11.

Q: The cafeteria had 23 apples. 20 used for lunch. 6 more bought. How many?
A: [LLM generates reasoning chain here]

Zero-shot CoT (Kojima et al., 2022): Add “Let’s think step by step” to any question, without examples. This single phrase triggers extended reasoning in large models.

The zero-shot variant works only for sufficiently large models — GPT-3 and above, approximately 100B+ parameters. Smaller models don’t benefit from CoT because they lack the underlying reasoning capabilities to leverage the structure.

Self-Consistency: Sampling Multiple Reasoning Paths

A single CoT answer can be wrong if the model takes a wrong reasoning branch. Wang et al. (2022) “Self-Consistency Improves Chain-of-Thought Reasoning in Language Models” addressed this.

Instead of generating one reasoning chain, generate $K$ independent chains (using temperature sampling for diversity), then take the majority vote on the final answers.

Chain 1: ... → 11
Chain 2: ... → 11  
Chain 3: ... → 9   (wrong path)
Chain 4: ... → 11
Chain 5: ... → 11

Majority vote: 11 ✓

With $K=40$ samples, self-consistency improved GSM8K from 57% to 78%. The improvement is consistent — diverse wrong paths rarely vote the same wrong answer, while correct paths tend to converge.

Cost: 40x more compute. For high-stakes decisions (medical, legal, scientific), this is often worth it.

Least-to-Most Prompting and Decomposition

Some problems are too complex for direct CoT. Least-to-most prompting (Zhou et al., 2022):

Decompose the question into simpler sub-questions
Solve the simpler sub-questions first
Use their solutions to answer the original question

Question: What is the boiling point of the most common element on Earth?
Step 1: What is the most common element on Earth? → Oxygen
Step 2: What is the boiling point of oxygen? → -183°C
Answer: -183°C

This enables solving “compositional” problems where the answer to sub-questions informs later sub-questions. On some symbolic reasoning benchmarks, least-to-most achieves near-perfect accuracy where standard CoT plateaus.

Tree of Thoughts

Yao et al. (2023) “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” extended linear chains to tree-structured search.

For a problem with multiple possible reasoning paths:

Generate multiple “thought” branches at each step
Evaluate each branch (using the LLM itself as evaluator)
Select the most promising branch to continue
Backtrack if a branch leads to a dead end

This enables deliberate planning — the model explores multiple approaches and commits to the best one rather than following one chain sequentially.

ToT dramatically improved performance on creative writing and puzzles (24-game: reach 24 using arithmetic operations) where single-path reasoning often fails: 4% accuracy with standard CoT → 74% accuracy with ToT.

o1 and the Reasoning Model Paradigm

OpenAI’s o1 (2024) represented a fundamental shift: instead of chain-of-thought as a prompting technique, it’s baked into the training and inference process.

o1 is trained with reinforcement learning to produce extended “reasoning tokens” — an internal scratchpad where the model thinks through problems — before generating the final response.

Key differences from prompting-based CoT:

Longer chains: o1 can reason for thousands of tokens before answering
Self-correction: o1 explicitly reconsiders and revises during its chain
Scaling: More “thinking” time → better performance (test-time compute scaling)
Hidden reasoning: The reasoning chain is hidden from users (only the answer is shown)

On competition math (AIME 2024): GPT-4o → ~13% accuracy; o1-preview → ~56% accuracy. On GPQA (graduate-level science): GPT-4o → 53%; o1 → 78%.

The “o1 insight”: intelligence can be partially decomposed into more thinking time on hard problems. This suggests a new scaling axis beyond training compute — inference time compute — and motivates the “reasoning model” paradigm where models reason extensively before answering.

One thing to remember: Chain-of-thought’s importance extends beyond prompting — it revealed that LLMs have latent reasoning capabilities that require explicit sequential expression, and this insight is now being leveraged at the architecture and training level to create fundamentally more capable reasoning systems.

chain-of-thoughtcotself-consistencytree-of-thoughtso1reasoning