LoRA Fine-Tuning — Deep Dive

Intrinsic dimensionality theory, LoRA+ asymmetric learning rates, DoRA weight decomposition, merged vs. multi-adapter serving, and LoRA for instruction tuning at scale.

Intrinsic Dimensionality: The Theoretical Foundation

Aghajanyan et al. (2020) “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” established the theoretical basis for parameter-efficient fine-tuning.

Intrinsic dimension $d_{90}$: The smallest $d$ such that fine-tuning in a $d$-dimensional random subspace achieves 90% of full fine-tuning performance.

Experimental protocol: project full parameter updates into a random $d$-dimensional subspace: $$\theta = \theta_0 + P \theta_d$$

Where $P \in \mathbb{R}^{N \times d}$ is a random projection matrix (frozen), and $\theta_d \in \mathbb{R}^d$ are the only trainable parameters.

Results:

BERT fine-tuning on most GLUE tasks: $d_{90} \approx 200$
Even for complex tasks (MultiNLI, STS-B): $d_{90} < 1000$
GPT-2 on NLP tasks: $d_{90} < 200$

These values are dramatically lower than model parameter counts (millions to billions), confirming that fine-tuning solutions concentrate in low-dimensional manifolds.

Why this happens: Pretrained models already encode useful representations. Fine-tuning doesn’t need to reinvent general language understanding — it only needs to select and compose existing capabilities appropriately for the target task. This task-specific composition requires far fewer degrees of freedom than the full parameter space.

LoRA+: Fixing the Learning Rate Asymmetry

Liu et al. (2024) “LoRA+: Efficient Low Rank Adaptation of Large Models” identified a suboptimality in standard LoRA training.

In LoRA, both $A$ and $B$ use the same learning rate $\eta$. But they play different roles:

$A$ maps the high-dimensional input to the low-dimensional subspace
$B$ maps from the subspace back to the output dimension

The gradient scaling analysis shows: in the low-rank matrix multiplication $\Delta W = BA$, the gradients flow differently. The $B$ matrix receives gradients of magnitude $O(|A|)$, while $A$ receives gradients of magnitude $O(|B|)$.

Since $B$ is initialized to zero, early in training $|B| \approx 0$, which causes $\nabla_A$ to be near zero — $A$ trains very slowly initially.

LoRA+ fix: Use different learning rates for $A$ and $B$: $$\eta_A = \eta, \quad \eta_B = \lambda \eta \text{ where } \lambda > 1$$

Empirically, $\lambda \in [2, 16]$ works best (task and model dependent). LoRA+ consistently improves performance by 1–2% over standard LoRA with no added compute cost.

DoRA: Weight Decomposition Low-Rank Adaptation

Liu et al. (2024) “DoRA: Weight-Decomposed Low-Rank Adaptation” decomposes the pretrained weight matrix into magnitude and direction components:

$$W_0 = m \cdot \frac{V}{|V|_c}$$

Where $m = |W_0|_c$ is the column-wise magnitude vector and $V / |V|_c$ is the column-wise normalized direction matrix. During fine-tuning:

$$W’ = m’ \cdot \frac{V + \Delta V}{|V + \Delta V|_c}$$

Where $\Delta V$ is the LoRA update ($BA$) applied to the direction component, and $m’$ is a learned magnitude vector (small, $d$-dimensional).

Why this helps: Full fine-tuning updates both magnitude and direction freely. Standard LoRA updates direction + magnitude proportionally (the LoRA update affects both). DoRA decouples these updates — direction is updated via LoRA, magnitude is updated freely as a small number of parameters.

Analysis shows DoRA learning patterns more closely resemble full fine-tuning than standard LoRA, particularly in terms of how magnitude and direction evolve during training. DoRA consistently outperforms LoRA by 1–5% on instruction tuning benchmarks (Commonsense Reasoning, Math reasoning) with the same rank $r$.

Multi-Adapter Serving

Production systems serving many users with different fine-tuned LoRA adapters face a serving challenge: different adapters are needed per request.

Naive approach: Load one adapter at a time → high latency per switch.

S-LoRA (Sheng et al., 2023): Batch requests using the same adapter, store adapter weights in unified memory, dynamically fetch from GPU memory as needed. Key insight: LoRA adapter weights are small (rank 16 adapter for Llama-7B ≈ 4MB) — dozens can fit in GPU memory simultaneously.

Punica (Chen et al., 2023): Custom CUDA kernels for batched LoRA computation. For a batch containing requests for adapters $[A_1, A_2, A_1, A_3]$, compute the LoRA contribution $BA_i x_i$ for all requests simultaneously using segmented matrix multiplication.

This enables serving thousands of different LoRA adapters on the same base model with minimal overhead — critical for LoRA-as-a-service platforms.

Adapter merging: When multiple LoRA adapters target the same task, they can be merged via weighted averaging: $$\Delta W_{merged} = \sum_i \lambda_i B_i A_i$$

Task arithmetic (Ilharco et al., 2023) showed that you can combine multiple task vectors (fine-tuned - pretrained) via addition or negation to create composite models. LoRA enables fine-grained control of this: add adapters from different tasks with different weights, subtract an adapter to “unlearn” task-specific behavior.

Practical Recipes for LLM Instruction Tuning with LoRA

Setup (for Llama-3.1-8B instruction tuning):

from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Alpha (= 2*r for scaling ~1)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,180,096 || trainable%: 1.03%

Training recommendations:

Learning rate: 1e-4 to 3e-4 (higher than full FT due to fewer parameters)
Warmup: 3-5% of total steps
Weight decay: 0.01 or 0 (LoRA adapters are small, less prone to overfit)
Batch size: maximize for stability
Data formatting: chat template matching the base model’s expected format

Evaluation: Compare LoRA fine-tuned model against base model on held-out evaluation set. If using instruction data, run MMLU or similar to check for catastrophic forgetting of pretrained capabilities.

Merging: After validation:

merged_model = model.merge_and_unload()  # Merges LoRA into base weights
merged_model.save_pretrained("./final_model")

One thing to remember: LoRA’s elegance is that it operates on the right abstraction — the low-rank nature of fine-tuning updates — rather than being a heuristic approximation, which explains its consistently strong performance across models, tasks, and scales.

loraqloradoralora-plusintrinsic-dimensionadapter-merging

LoRA Fine-Tuning — Deep Dive

Intrinsic Dimensionality: The Theoretical Foundation

LoRA+: Fixing the Learning Rate Asymmetry

DoRA: Weight Decomposition Low-Rank Adaptation

Multi-Adapter Serving

Practical Recipes for LLM Instruction Tuning with LoRA

See Also

Related Topics