LoRA Fine-Tuning — Core Concepts

The Core Observation: Fine-Tuning Is Low-Rank

Hu et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models” built on an empirical observation: when fine-tuning pretrained models, the weight change matrix $\Delta W$ has low intrinsic rank.

They measured the rank of $\Delta W = W_{fine-tuned} - W_{pretrained}$ for BERT and GPT-2 and found that the dominant singular values capture most of the variance in a very low-dimensional subspace. Fine-tuning doesn’t require modifying the weight matrix along all $d \times d$ directions — just a few principal directions.

This is related to intrinsic dimensionality (Aghajanyan et al., 2020): fine-tuning solutions lie in a surprisingly low-dimensional subspace of parameter space. For BERT on many tasks, the intrinsic dimension is < 200 — meaning fine-tuning effectively optimizes over just 200 parameters, despite the model having millions.

LoRA Mathematics

For a pretrained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA represents the weight update as:

$$W = W_0 + \Delta W = W_0 + BA$$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.

Initialization: $A$ is initialized with random Gaussian; $B$ is initialized to zero. This ensures $\Delta W = BA = 0$ at the start of training — the model starts with its pretrained behavior.

Scaling: The LoRA output is scaled by $\alpha / r$, where $\alpha$ is a hyperparameter: $$h = W_0 x + \frac{\alpha}{r} BA x$$

Scaling by $\alpha / r$ keeps the effective learning rate somewhat constant when changing $r$. Setting $\alpha = r$ (common default) gives no scaling.

During inference: The LoRA matrices can be merged into $W_0$: $W’ = W_0 + BA$. No inference overhead — the model runs at the same speed as the original.

Parameter count: For a $4096 \times 4096$ matrix with $r=16$: $16 \times 4096 \times 2 = 131,072$ trainable parameters, vs. $16,777,216$ for full fine-tuning. A 128x reduction.

Choosing Rank and Target Modules

Rank selection: $r \in {4, 8, 16, 32, 64}$ are common choices.

  • $r=4$: Very few parameters, sufficient for simple tasks
  • $r=16$: Good balance for most tasks
  • $r=64+$: More capacity, closer to full fine-tuning but more parameters

The original paper found $r=4$ or $r=8$ sufficient for most NLP tasks. For complex tasks requiring style adaptation or domain shift, $r=16$–$r=32$ is better.

Target modules: LoRA can be applied to any matrix. In transformers, common choices:

  • Attention: Q, K, V, O projection matrices (most common)
  • MLP: up/down/gate projection matrices
  • Embeddings (rarely)

The original paper found that adapting both attention and MLP layers gives the best results, but only attention matrices is often sufficient for instruction following.

QLoRA: 4-bit Base Model + LoRA

Dettmers et al. (2023) “QLoRA: Efficient Finetuning of Quantized LLMs” enabled fine-tuning 65B parameter models on a single 48GB GPU:

  1. Quantize the base model to NF4 (4-bit Normal Float, optimal for normally distributed weights)
  2. Apply LoRA adapters in full precision (BF16) on top
  3. Backpropagate through the 4-bit weights using “double dequantization”
  4. Only the LoRA adapter parameters are updated

Key innovations:

  • NF4 quantization: Optimize the 4-bit quantization grid for the normal distribution (which weights approximately follow). Better quality than linear INT4 for the same bit count.
  • Double quantization: Quantize the quantization constants themselves — saves 0.5 bits per parameter
  • Paged optimizers: Use GPU’s paged memory system to handle memory spikes during gradient computation

Results: QLoRA fine-tuned Guanaco-65B (publicly available, on a single 48GB A100 for 24 hours) matched ChatGPT performance on Vicuna benchmark. Democratized 65B+ model fine-tuning to academic labs.

Alternatives to LoRA

Prefix tuning (Li & Liang, 2021): Add a sequence of learnable “prefix” tokens to each transformer layer’s key and value. The main model is frozen; only prefix embeddings are trained. Fewer parameters than LoRA but more complex attention computation (prefix tokens always attend to other tokens).

Adapter layers (Houlsby et al., 2019): Insert small feed-forward modules between transformer layers. Each adapter: linear → nonlinearity → linear with a residual connection. Adds inference latency (can’t merge into base model). LoRA supersedes this.

Prompt tuning (Lester et al., 2021): Add trainable tokens only to the input embedding. Simplest, fewest parameters. Works well only for large models (T5-11B and above) and is less flexible than LoRA.

IA3 (Liu et al., 2022): Learn per-element scaling vectors for keys, values, and feed-forward activations. Even fewer parameters than LoRA (typically 0.01% of model). Designed for efficient multi-task learning where many tasks share a base model.

When LoRA Works and When It Doesn’t

LoRA excels at:

  • Instruction following / chat behavior adaptation
  • Domain-specific language/terminology
  • Output format customization
  • Style transfer (more formal, more concise)

LoRA struggles with:

  • Learning genuinely new knowledge not in pretraining (the adapter can’t “add” factual knowledge the base model doesn’t have — it can only direct how existing knowledge is retrieved)
  • Large behavioral shifts (safety fine-tuning with fundamental value changes)
  • Very long-horizon tasks requiring deep architectural changes

Full fine-tuning is still better when:

  • You have enough data (>1M examples)
  • The task requires deep behavioral adaptation
  • Compute isn’t a constraint

Empirically, on many NLP benchmarks with moderate data, full fine-tuning outperforms LoRA by 1–5%. For low-data regimes (<10k examples), LoRA often matches or exceeds full fine-tuning (due to stronger regularization).

One thing to remember: LoRA works because fine-tuning updates lie in a low-dimensional subspace of the weight space — exploiting this structure makes adaptation 100x more parameter-efficient with only modest quality loss.

loraqlorafine-tuningparameter-efficientpeftadaptersrank

See Also

  • Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
  • Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
  • Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
  • Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.
  • Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.