LoRA Fine-Tuning — Deep Dive
Intrinsic Dimensionality: The Theoretical Foundation
Aghajanyan et al. (2020) “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” established the theoretical basis for parameter-efficient fine-tuning.
Intrinsic dimension $d_{90}$: The smallest $d$ such that fine-tuning in a $d$-dimensional random subspace achieves 90% of full fine-tuning performance.
Experimental protocol: project full parameter updates into a random $d$-dimensional subspace: $$\theta = \theta_0 + P \theta_d$$
Where $P \in \mathbb{R}^{N \times d}$ is a random projection matrix (frozen), and $\theta_d \in \mathbb{R}^d$ are the only trainable parameters.
Results:
- BERT fine-tuning on most GLUE tasks: $d_{90} \approx 200$
- Even for complex tasks (MultiNLI, STS-B): $d_{90} < 1000$
- GPT-2 on NLP tasks: $d_{90} < 200$
These values are dramatically lower than model parameter counts (millions to billions), confirming that fine-tuning solutions concentrate in low-dimensional manifolds.
Why this happens: Pretrained models already encode useful representations. Fine-tuning doesn’t need to reinvent general language understanding — it only needs to select and compose existing capabilities appropriately for the target task. This task-specific composition requires far fewer degrees of freedom than the full parameter space.
LoRA+: Fixing the Learning Rate Asymmetry
Liu et al. (2024) “LoRA+: Efficient Low Rank Adaptation of Large Models” identified a suboptimality in standard LoRA training.
In LoRA, both $A$ and $B$ use the same learning rate $\eta$. But they play different roles:
- $A$ maps the high-dimensional input to the low-dimensional subspace
- $B$ maps from the subspace back to the output dimension
The gradient scaling analysis shows: in the low-rank matrix multiplication $\Delta W = BA$, the gradients flow differently. The $B$ matrix receives gradients of magnitude $O(|A|)$, while $A$ receives gradients of magnitude $O(|B|)$.
Since $B$ is initialized to zero, early in training $|B| \approx 0$, which causes $\nabla_A$ to be near zero — $A$ trains very slowly initially.
LoRA+ fix: Use different learning rates for $A$ and $B$: $$\eta_A = \eta, \quad \eta_B = \lambda \eta \text{ where } \lambda > 1$$
Empirically, $\lambda \in [2, 16]$ works best (task and model dependent). LoRA+ consistently improves performance by 1–2% over standard LoRA with no added compute cost.
DoRA: Weight Decomposition Low-Rank Adaptation
Liu et al. (2024) “DoRA: Weight-Decomposed Low-Rank Adaptation” decomposes the pretrained weight matrix into magnitude and direction components:
$$W_0 = m \cdot \frac{V}{|V|_c}$$
Where $m = |W_0|_c$ is the column-wise magnitude vector and $V / |V|_c$ is the column-wise normalized direction matrix. During fine-tuning:
$$W’ = m’ \cdot \frac{V + \Delta V}{|V + \Delta V|_c}$$
Where $\Delta V$ is the LoRA update ($BA$) applied to the direction component, and $m’$ is a learned magnitude vector (small, $d$-dimensional).
Why this helps: Full fine-tuning updates both magnitude and direction freely. Standard LoRA updates direction + magnitude proportionally (the LoRA update affects both). DoRA decouples these updates — direction is updated via LoRA, magnitude is updated freely as a small number of parameters.
Analysis shows DoRA learning patterns more closely resemble full fine-tuning than standard LoRA, particularly in terms of how magnitude and direction evolve during training. DoRA consistently outperforms LoRA by 1–5% on instruction tuning benchmarks (Commonsense Reasoning, Math reasoning) with the same rank $r$.
Multi-Adapter Serving
Production systems serving many users with different fine-tuned LoRA adapters face a serving challenge: different adapters are needed per request.
Naive approach: Load one adapter at a time → high latency per switch.
S-LoRA (Sheng et al., 2023): Batch requests using the same adapter, store adapter weights in unified memory, dynamically fetch from GPU memory as needed. Key insight: LoRA adapter weights are small (rank 16 adapter for Llama-7B ≈ 4MB) — dozens can fit in GPU memory simultaneously.
Punica (Chen et al., 2023): Custom CUDA kernels for batched LoRA computation. For a batch containing requests for adapters $[A_1, A_2, A_1, A_3]$, compute the LoRA contribution $BA_i x_i$ for all requests simultaneously using segmented matrix multiplication.
This enables serving thousands of different LoRA adapters on the same base model with minimal overhead — critical for LoRA-as-a-service platforms.
Adapter merging: When multiple LoRA adapters target the same task, they can be merged via weighted averaging: $$\Delta W_{merged} = \sum_i \lambda_i B_i A_i$$
Task arithmetic (Ilharco et al., 2023) showed that you can combine multiple task vectors (fine-tuned - pretrained) via addition or negation to create composite models. LoRA enables fine-grained control of this: add adapters from different tasks with different weights, subtract an adapter to “unlearn” task-specific behavior.
Practical Recipes for LLM Instruction Tuning with LoRA
Setup (for Llama-3.1-8B instruction tuning):
from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha (= 2*r for scaling ~1)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,180,096 || trainable%: 1.03%
Training recommendations:
- Learning rate: 1e-4 to 3e-4 (higher than full FT due to fewer parameters)
- Warmup: 3-5% of total steps
- Weight decay: 0.01 or 0 (LoRA adapters are small, less prone to overfit)
- Batch size: maximize for stability
- Data formatting: chat template matching the base model’s expected format
Evaluation: Compare LoRA fine-tuned model against base model on held-out evaluation set. If using instruction data, run MMLU or similar to check for catastrophic forgetting of pretrained capabilities.
Merging: After validation:
merged_model = model.merge_and_unload() # Merges LoRA into base weights
merged_model.save_pretrained("./final_model")
One thing to remember: LoRA’s elegance is that it operates on the right abstraction — the low-rank nature of fine-tuning updates — rather than being a heuristic approximation, which explains its consistently strong performance across models, tasks, and scales.
See Also
- Contrastive Learning How AI learns what things are like each other — and what they're not — without any labels, creating the representations behind image search and face recognition.
- Data Augmentation How AI systems make do with less data by creating variations of what they have — the training trick that prevented ImageNet models from memorizing training examples.
- Few Shot Learning How AI learned to learn from just a handful of examples — the technique that lets AI generalize like humans instead of needing millions of training samples.
- Reinforcement Learning Fundamentals How AI learns from trial, error, and rewards — the technique that beat the world chess champion, solved protein folding, and is now teaching robots to walk.
- Self Supervised Learning How AI learned to teach itself from unlabeled data — the technique that let GPT and BERT learn from the entire internet without any human labeling.