Few-Shot Learning — Core Concepts
The N-way K-shot Problem
Few-shot learning is typically framed as N-way K-shot classification:
- N: Number of new classes to distinguish (e.g., 5-way = distinguish between 5 new classes)
- K: Number of labeled examples per class available at test time
Concrete example: 5-way 1-shot. You’re given 1 photo each of 5 animals you’ve never seen in training (aardvark, tapir, capybara, pangolin, platypus). Then shown a test photo: which of these 5 is it?
A model with no knowledge of these animals would do at best 20% accuracy (random). A good few-shot learner uses visual feature representations learned from other animals (cats, dogs, horses) to infer visual similarity and achieve 70–80% accuracy.
The key: the 5 labeled support set examples are only used at test time. The model can’t be retrained on them.
Two Paradigms: Metric Learning and Meta-Learning
Metric Learning: Prototypical Networks
Snell et al. (2017) “Prototypical Networks for Few-Shot Learning” is the simplest effective approach.
Core idea: Each class is represented by the mean of its support set examples in embedding space (its “prototype”). Classification is nearest-prototype lookup.
For class $c$ with support examples ${x_1^c, …, x_K^c}$: $$\text{prototype}c = \frac{1}{K} \sum{i=1}^K f_\phi(x_i^c)$$
Classification of query $x$: $$p(y = c | x) = \frac{\exp(-d(f_\phi(x), \text{prototype}c))}{\sum{c’} \exp(-d(f_\phi(x), \text{prototype}_{c’}))}$$
Where $f_\phi$ is a learned encoder and $d$ is Euclidean distance.
Training uses episodic training — simulate the few-shot task during training. Each episode randomly selects N classes, K support examples, and Q query examples from training classes. The encoder learns representations where classes cluster and are well-separated.
Why it works: With a good encoder (learned from many training episodes), similar images have similar embeddings. The prototype (mean embedding) represents the “center of gravity” of each class. Nearest-prototype classification is robust to single-example noise.
Results: Prototypical networks on miniImageNet 5-way 5-shot: ~65% accuracy with standard CNNs, ~85%+ with ViT encoders. Human performance on the same task: ~90%.
Meta-Learning: MAML
Model-Agnostic Meta-Learning (MAML, Finn et al., 2017) learns not a good model, but a good initialization for fast adaptation.
Training objective: Find parameters $\theta$ such that after one or a few gradient steps on any new task, the model performs well:
$$\min_\theta \sum_{\mathcal{T}i} \mathcal{L}{\mathcal{T}i}(f{\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}i}(f\theta)})$$
Inner loop: gradient step on the specific task with rate $\alpha$. Outer loop: update $\theta$ based on performance after the inner loop step.
The outer loop optimizes $\theta$ for the outcome of adaptation, not for current performance. The result: $\theta$ is not a good general model (it may perform poorly before adaptation), but it’s a good starting point that reaches high performance in one gradient step.
MAML requires second-order derivatives (gradient through gradient), making it expensive. FOMAML (first-order approximation) drops second-order terms and performs nearly as well in practice.
Reptile (OpenAI, 2018): A simpler meta-learning algorithm — repeatedly sample tasks, train to convergence, move $\theta$ toward the task-specific parameters:
$$\theta \leftarrow \theta + \epsilon(\tilde{\theta}_i - \theta)$$
Reptile is easier to implement, doesn’t require second-order derivatives, and performs comparably to MAML on most benchmarks.
In-Context Learning: Few-Shot Without Gradients
GPT-3 (Brown et al., 2020) demonstrated a new paradigm: few-shot learning entirely through the prompt, without any gradient updates at test time.
For translation few-shot:
English: I love pizza. French: J'aime la pizza.
English: Where is the library? French: Où est la bibliothèque?
English: The cat is sleeping. French:
GPT-3 completes this with “Le chat dort.” It learned translation from two examples in the context window.
This “in-context learning” (ICL) isn’t the same as standard meta-learning — no gradient updates occur. The model uses its frozen parameters to perform the new task.
What enables ICL? Mechanistic interpretability research suggests ICL relies on “induction heads” — attention mechanisms that identify patterns of the form “A appears, then B appears; A appears again, therefore predict B.” When shown translation examples, these heads identify the pattern: “[English phrase] → [French phrase]”, and apply it to the test phrase.
This mechanism emerges at model scale. Small GPT-2-scale models have limited ICL ability; GPT-3-scale models generalize dramatically better.
Practical Few-Shot Strategies
Prompt engineering as few-shot learning: For LLMs, carefully chosen few-shot examples dramatically affect output quality. Selecting examples that:
- Span the input distribution
- Include edge cases relevant to the task
- Follow the exact input-output format you want
can improve accuracy by 10–30% vs. zero-shot prompting on many tasks.
Retrieval-augmented few-shot: Retrieve the most relevant examples from a database dynamically, rather than using fixed examples. For each query, embed it, find k-nearest neighbors in a labeled example database, use those as the few-shot context. Particularly effective for tasks with diverse input types.
Parameter-efficient fine-tuning (LoRA, prefix tuning): For production few-shot tasks, 10–100 examples often justify light fine-tuning with LoRA (Low-Rank Adaptation). Adding adapters that update only 0.1% of parameters can significantly outperform pure in-context learning.
One thing to remember: Few-shot learning is the convergence of two ideas — rich pretrained representations (that encode general visual/linguistic structure) and efficient adaptation mechanisms (that use those representations to generalize from minimal examples).
See Also
- Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
- Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
- Lora Fine Tuning How AI companies adapt massive models to specific tasks by training only a tiny fraction of the parameters — the technique making custom AI affordable.
- Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.
- Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.