LoRA Fine-Tuning — Core Concepts

How Low-Rank Adaptation works mathematically, rank selection, which layers to adapt, QLoRA for 4-bit fine-tuning, and when LoRA outperforms full fine-tuning.

The Core Observation: Fine-Tuning Is Low-Rank

Hu et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models” built on an empirical observation: when fine-tuning pretrained models, the weight change matrix $\Delta W$ has low intrinsic rank.

They measured the rank of $\Delta W = W_{fine-tuned} - W_{pretrained}$ for BERT and GPT-2 and found that the dominant singular values capture most of the variance in a very low-dimensional subspace. Fine-tuning doesn’t require modifying the weight matrix along all $d \times d$ directions — just a few principal directions.

This is related to intrinsic dimensionality (Aghajanyan et al., 2020): fine-tuning solutions lie in a surprisingly low-dimensional subspace of parameter space. For BERT on many tasks, the intrinsic dimension is < 200 — meaning fine-tuning effectively optimizes over just 200 parameters, despite the model having millions.

LoRA Mathematics

For a pretrained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA represents the weight update as:

$$W = W_0 + \Delta W = W_0 + BA$$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.

Initialization: $A$ is initialized with random Gaussian; $B$ is initialized to zero. This ensures $\Delta W = BA = 0$ at the start of training — the model starts with its pretrained behavior.

Scaling: The LoRA output is scaled by $\alpha / r$, where $\alpha$ is a hyperparameter: $$h = W_0 x + \frac{\alpha}{r} BA x$$

Scaling by $\alpha / r$ keeps the effective learning rate somewhat constant when changing $r$. Setting $\alpha = r$ (common default) gives no scaling.

During inference: The LoRA matrices can be merged into $W_0$: $W’ = W_0 + BA$. No inference overhead — the model runs at the same speed as the original.

Parameter count: For a $4096 \times 4096$ matrix with $r=16$: $16 \times 4096 \times 2 = 131,072$ trainable parameters, vs. $16,777,216$ for full fine-tuning. A 128x reduction.

Choosing Rank and Target Modules

Rank selection: $r \in {4, 8, 16, 32, 64}$ are common choices.

$r=4$: Very few parameters, sufficient for simple tasks
$r=16$: Good balance for most tasks
$r=64+$: More capacity, closer to full fine-tuning but more parameters

The original paper found $r=4$ or $r=8$ sufficient for most NLP tasks. For complex tasks requiring style adaptation or domain shift, $r=16$–$r=32$ is better.

Target modules: LoRA can be applied to any matrix. In transformers, common choices:

Attention: Q, K, V, O projection matrices (most common)
MLP: up/down/gate projection matrices
Embeddings (rarely)

The original paper found that adapting both attention and MLP layers gives the best results, but only attention matrices is often sufficient for instruction following.

QLoRA: 4-bit Base Model + LoRA

Dettmers et al. (2023) “QLoRA: Efficient Finetuning of Quantized LLMs” enabled fine-tuning 65B parameter models on a single 48GB GPU:

Quantize the base model to NF4 (4-bit Normal Float, optimal for normally distributed weights)
Apply LoRA adapters in full precision (BF16) on top
Backpropagate through the 4-bit weights using “double dequantization”
Only the LoRA adapter parameters are updated

Key innovations:

NF4 quantization: Optimize the 4-bit quantization grid for the normal distribution (which weights approximately follow). Better quality than linear INT4 for the same bit count.
Double quantization: Quantize the quantization constants themselves — saves 0.5 bits per parameter
Paged optimizers: Use GPU’s paged memory system to handle memory spikes during gradient computation

Results: QLoRA fine-tuned Guanaco-65B (publicly available, on a single 48GB A100 for 24 hours) matched ChatGPT performance on Vicuna benchmark. Democratized 65B+ model fine-tuning to academic labs.

Alternatives to LoRA

Prefix tuning (Li & Liang, 2021): Add a sequence of learnable “prefix” tokens to each transformer layer’s key and value. The main model is frozen; only prefix embeddings are trained. Fewer parameters than LoRA but more complex attention computation (prefix tokens always attend to other tokens).

Adapter layers (Houlsby et al., 2019): Insert small feed-forward modules between transformer layers. Each adapter: linear → nonlinearity → linear with a residual connection. Adds inference latency (can’t merge into base model). LoRA supersedes this.

Prompt tuning (Lester et al., 2021): Add trainable tokens only to the input embedding. Simplest, fewest parameters. Works well only for large models (T5-11B and above) and is less flexible than LoRA.

IA3 (Liu et al., 2022): Learn per-element scaling vectors for keys, values, and feed-forward activations. Even fewer parameters than LoRA (typically 0.01% of model). Designed for efficient multi-task learning where many tasks share a base model.

When LoRA Works and When It Doesn’t

LoRA excels at:

Instruction following / chat behavior adaptation
Domain-specific language/terminology
Output format customization
Style transfer (more formal, more concise)

LoRA struggles with:

Learning genuinely new knowledge not in pretraining (the adapter can’t “add” factual knowledge the base model doesn’t have — it can only direct how existing knowledge is retrieved)
Large behavioral shifts (safety fine-tuning with fundamental value changes)
Very long-horizon tasks requiring deep architectural changes

Full fine-tuning is still better when:

You have enough data (>1M examples)
The task requires deep behavioral adaptation
Compute isn’t a constraint

Empirically, on many NLP benchmarks with moderate data, full fine-tuning outperforms LoRA by 1–5%. For low-data regimes (<10k examples), LoRA often matches or exceeds full fine-tuning (due to stronger regularization).

One thing to remember: LoRA works because fine-tuning updates lie in a low-dimensional subspace of the weight space — exploiting this structure makes adaptation 100x more parameter-efficient with only modest quality loss.

loraqlorafine-tuningparameter-efficientpeftadaptersrank