Few-Shot Learning — Core Concepts

N-way K-shot classification, prototypical networks, MAML meta-learning, and how in-context learning in LLMs achieves few-shot performance without gradient updates.

The N-way K-shot Problem

Few-shot learning is typically framed as N-way K-shot classification:

N: Number of new classes to distinguish (e.g., 5-way = distinguish between 5 new classes)
K: Number of labeled examples per class available at test time

Concrete example: 5-way 1-shot. You’re given 1 photo each of 5 animals you’ve never seen in training (aardvark, tapir, capybara, pangolin, platypus). Then shown a test photo: which of these 5 is it?

A model with no knowledge of these animals would do at best 20% accuracy (random). A good few-shot learner uses visual feature representations learned from other animals (cats, dogs, horses) to infer visual similarity and achieve 70–80% accuracy.

The key: the 5 labeled support set examples are only used at test time. The model can’t be retrained on them.

Two Paradigms: Metric Learning and Meta-Learning

Metric Learning: Prototypical Networks

Snell et al. (2017) “Prototypical Networks for Few-Shot Learning” is the simplest effective approach.

Core idea: Each class is represented by the mean of its support set examples in embedding space (its “prototype”). Classification is nearest-prototype lookup.

For class $c$ with support examples ${x_1^c, …, x_K^c}$: $$\text{prototype}c = \frac{1}{K} \sum{i=1}^K f_\phi(x_i^c)$$

Classification of query $x$: $$p(y = c | x) = \frac{\exp(-d(f_\phi(x), \text{prototype}c))}{\sum{c’} \exp(-d(f_\phi(x), \text{prototype}_{c’}))}$$

Where $f_\phi$ is a learned encoder and $d$ is Euclidean distance.

Training uses episodic training — simulate the few-shot task during training. Each episode randomly selects N classes, K support examples, and Q query examples from training classes. The encoder learns representations where classes cluster and are well-separated.

Why it works: With a good encoder (learned from many training episodes), similar images have similar embeddings. The prototype (mean embedding) represents the “center of gravity” of each class. Nearest-prototype classification is robust to single-example noise.

Results: Prototypical networks on miniImageNet 5-way 5-shot: ~65% accuracy with standard CNNs, ~85%+ with ViT encoders. Human performance on the same task: ~90%.

Meta-Learning: MAML

Model-Agnostic Meta-Learning (MAML, Finn et al., 2017) learns not a good model, but a good initialization for fast adaptation.

Training objective: Find parameters $\theta$ such that after one or a few gradient steps on any new task, the model performs well:

$$\min_\theta \sum_{\mathcal{T}i} \mathcal{L}{\mathcal{T}i}(f{\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}i}(f\theta)})$$

Inner loop: gradient step on the specific task with rate $\alpha$. Outer loop: update $\theta$ based on performance after the inner loop step.

The outer loop optimizes $\theta$ for the outcome of adaptation, not for current performance. The result: $\theta$ is not a good general model (it may perform poorly before adaptation), but it’s a good starting point that reaches high performance in one gradient step.

MAML requires second-order derivatives (gradient through gradient), making it expensive. FOMAML (first-order approximation) drops second-order terms and performs nearly as well in practice.

Reptile (OpenAI, 2018): A simpler meta-learning algorithm — repeatedly sample tasks, train to convergence, move $\theta$ toward the task-specific parameters:

$$\theta \leftarrow \theta + \epsilon(\tilde{\theta}_i - \theta)$$

Reptile is easier to implement, doesn’t require second-order derivatives, and performs comparably to MAML on most benchmarks.

In-Context Learning: Few-Shot Without Gradients

GPT-3 (Brown et al., 2020) demonstrated a new paradigm: few-shot learning entirely through the prompt, without any gradient updates at test time.

For translation few-shot:

English: I love pizza. French: J'aime la pizza.
English: Where is the library? French: Où est la bibliothèque?
English: The cat is sleeping. French:

GPT-3 completes this with “Le chat dort.” It learned translation from two examples in the context window.

This “in-context learning” (ICL) isn’t the same as standard meta-learning — no gradient updates occur. The model uses its frozen parameters to perform the new task.

What enables ICL? Mechanistic interpretability research suggests ICL relies on “induction heads” — attention mechanisms that identify patterns of the form “A appears, then B appears; A appears again, therefore predict B.” When shown translation examples, these heads identify the pattern: “[English phrase] → [French phrase]”, and apply it to the test phrase.

This mechanism emerges at model scale. Small GPT-2-scale models have limited ICL ability; GPT-3-scale models generalize dramatically better.

Practical Few-Shot Strategies

Prompt engineering as few-shot learning: For LLMs, carefully chosen few-shot examples dramatically affect output quality. Selecting examples that:

Span the input distribution
Include edge cases relevant to the task
Follow the exact input-output format you want

can improve accuracy by 10–30% vs. zero-shot prompting on many tasks.

Retrieval-augmented few-shot: Retrieve the most relevant examples from a database dynamically, rather than using fixed examples. For each query, embed it, find k-nearest neighbors in a labeled example database, use those as the few-shot context. Particularly effective for tasks with diverse input types.

Parameter-efficient fine-tuning (LoRA, prefix tuning): For production few-shot tasks, 10–100 examples often justify light fine-tuning with LoRA (Low-Rank Adaptation). Adding adapters that update only 0.1% of parameters can significantly outperform pure in-context learning.

One thing to remember: Few-shot learning is the convergence of two ideas — rich pretrained representations (that encode general visual/linguistic structure) and efficient adaptation mechanisms (that use those representations to generalize from minimal examples).

few-shot-learningmeta-learningprototypical-networksmamlin-context-learning