Fine-Tuning — Deep Dive

LoRA internals, RLHF vs DPO tradeoffs, catastrophic forgetting solutions, and what the fine-tuning landscape actually looks like in production in 2026.

What “Updating Weights” Actually Means

During pre-training, a language model learns to predict the next token by adjusting billions of floating-point numbers (weights) via gradient descent. After training, those weights are frozen and shipped. Fine-tuning resumes the gradient updates on a specific dataset.

But fine-tuning isn’t free to change weights in any direction. It’s constrained by the optimizer state, the learning rate schedule, and crucially by the loss signal you define. Get the loss signal wrong and you’ll get a model that optimizes exactly for the wrong thing.

The standard fine-tuning setup for instruction-following:

Dataset format: Convert examples to a consistent template. For most modern instruction-tuned models, this means prompt/response pairs formatted with special tokens (<|user|>, <|assistant|>, etc.). Mistralformat, ChatML, and Alpaca format are the common options. Mixing formats in one training run is a common source of degradation.
Loss masking: You only want to compute loss on the model’s output tokens, not on the prompt. If you compute loss on both, the model wastes capacity learning to reproduce the input, which doesn’t help. In practice: set labels = -100 for all input token positions in PyTorch (the standard ignored index for cross-entropy).
Learning rate: Fine-tuning requires much lower learning rates than pre-training — typically 1e-5 to 5e-5 vs 3e-4 for pre-training. Too high and you overwrite important pre-trained representations (catastrophic forgetting). Too low and you don’t update meaningfully.

LoRA Internals

LoRA’s core claim: the weight updates needed for fine-tuning are low-rank. That is, the change ΔW can be decomposed as the product of two small matrices:

ΔW = B × A

Where W has shape (d_out × d_in), A has shape (r × d_in), and B has shape (d_out × r), with r << min(d_in, d_out). This is the low-rank decomposition.

In practice: instead of updating W directly, you freeze W and train A and B. During inference, you compute W_effective = W + B × A × α/r, where α is a scaling hyperparameter.

The rank r controls the expressivity/efficiency tradeoff. Common values range from 4 to 64:

r=4: Very parameter-efficient. Good for simple stylistic changes. Might not be expressive enough for complex task adaptation.
r=16: Common default. Covers most fine-tuning tasks.
r=64: Approaches full fine-tuning expressivity at significantly lower cost.

Which layers get LoRA adapters matters. The original paper applied LoRA to attention weight matrices (Q, K, V projections). Later work found that also applying to feed-forward layers (the up/down projection matrices in the MLP blocks) often helps, especially for domain adaptation tasks. The lm_head (output projection) is sometimes included for significant behavioral shifts.

QLoRA: 4-bit Base + LoRA Adapters

QLoRA (Dettmers et al., 2023) made fine-tuning much larger models tractable on consumer hardware:

Quantize the frozen base model to 4-bit NF4 (Normal Float 4) format — a quantization scheme specifically designed for normally distributed weights, which neural network weights approximate reasonably well.
Keep quantized weights in 4-bit during the forward pass.
Dequantize to BF16 for the LoRA adapter gradient computation.
Use double quantization (quantizing the quantization constants themselves) to reduce memory overhead further.
Use paged optimizers to offload optimizer states to CPU when GPU memory is full.

This gets you fine-tuning of Llama 65B on a single 48GB A100 or Llama 7B on a 16GB consumer GPU. The memory reduction is roughly 4x vs standard BF16 training.

The accuracy tradeoff is modest — typically within 1-2% of full-precision fine-tuning on standard benchmarks. For most production use cases, this is acceptable.

Adapter Selection in 2025-2026

LoRA has largely won the PEFT competition, but alternatives exist for specific cases:

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations): Even more parameter-efficient than LoRA (3x fewer parameters). Works by learned rescaling vectors rather than additive updates. Good when GPU memory is extremely constrained.
Prompt tuning: Prepend learnable “soft tokens” to the input embedding space. The model weights are entirely frozen. Extremely efficient but underperforms LoRA for most tasks.
Prefix tuning: Prepend learned prefix vectors to each layer’s key/value matrices. More expressive than prompt tuning, fewer parameters than LoRA.

For production fine-tuning in 2026, the default is LoRA/QLoRA unless you have a specific reason to deviate.

Instruction Tuning: The Real Complexity

Getting a base model to follow instructions well is harder than it looks, and the dataset quality matters enormously.

The LIMA Result

The LIMA paper (Meta, 2023) showed that a Llama 65B fine-tuned on just 1,000 carefully curated examples matched or exceeded models trained on orders of magnitude more data. The conclusion: data quality > data quantity, especially for instruction following.

This was somewhat overstated — 1,000 examples isn’t enough for coverage across many domains — but the core finding holds. 1,000 high-quality, diverse, well-formatted examples beats 100,000 noisy examples scraped from the internet.

What “high quality” means in practice:

The response genuinely answers the question
The format is appropriate to the question type
The instruction diversity covers many different request types (explanation, summarization, translation, coding, creative writing, etc.)
No repetition of similar examples that trains the model to be monotone

RLHF Pipeline Details

The full RLHF pipeline from the InstructGPT paper (which became ChatGPT) has three stages:

Stage 1 — Supervised Fine-Tuning (SFT): Collect ~13,000 human-written prompt/response pairs (OpenAI used contractors for this). Fine-tune GPT-3 on these. This gives you a model that at least tries to be helpful.

Stage 2 — Reward Model Training: Collect ~33,000 pairs of model outputs for the same prompt. Have humans rank which output they prefer. Train a separate model (smaller, often same architecture with a value head replacing the language model head) to predict human preference from a (prompt, response) pair. This is your reward model.

Stage 3 — RL Fine-Tuning (PPO): Use the reward model as a reward signal and fine-tune the SFT model with PPO. Add a KL penalty term against the original SFT model to prevent the model from “reward hacking” its way to outputs that score high but are actually gibberish or manipulative.

The KL penalty is critical. Without it, the model finds adversarial outputs that fool the reward model while being useless. The coefficient of this penalty is a major hyperparameter — too small and you get reward hacking, too large and the RL training makes no progress.

DPO: Simpler, Often Just as Good

Direct Preference Optimization (Rafailov et al., 2023) reframes the RLHF objective to eliminate the explicit reward model and RL training loop.

The key insight: the optimal RLHF policy has a closed-form relationship to the reference model (SFT model). You can reparametrize the reward function in terms of the language model itself and optimize the preference objective directly via supervised learning.

The DPO loss:

L_DPO = -E[log σ(β(log π_θ(y_w|x) - log π_ref(y_w|x)) - β(log π_θ(y_l|x) - log π_ref(y_l|x)))]

Where y_w is the preferred response, y_l is the rejected response, π_θ is the policy being trained, π_ref is the reference model, and β controls the KL constraint.

In plain terms: increase the probability of preferred outputs relative to the reference model; decrease the probability of rejected outputs relative to the reference model. The β term keeps you from drifting too far from the reference.

DPO requires:

The SFT model (reference model, frozen)
Preference pairs (prompt + winning response + losing response)
Simpler training loop (no RL, no PPO, no reward model)

The practical advantage is enormous: DPO is much simpler to implement and tune. Several papers have shown it matches PPO-based RLHF on standard benchmarks. It doesn’t always match on harder reasoning tasks — some evidence that PPO’s RL dynamics provide benefits DPO can’t replicate — but for most commercial fine-tuning tasks, DPO is the default today.

SimPO (Simple Preference Optimization, 2024) is a newer variant that removes the reference model entirely, using average log-likelihood as the implicit reward. Requires less memory, often matches DPO on instruction following.

Catastrophic Forgetting: Still Unsolved

The fundamental tension in fine-tuning: every gradient update that moves the model toward your task pulls it slightly away from what it learned during pre-training. Update too aggressively and the model “forgets” its general capabilities.

This is most visible when you fine-tune on a narrow domain. A model fine-tuned heavily on legal documents starts losing its ability to write poetry or do arithmetic. The information was in the weights; the weights were overwritten.

Approaches to Mitigation

Learning rate warmup + decay: Gradual learning rate schedule prevents large destructive updates early in training. Standard practice.

Replay / data mixing: Mix a small amount of pre-training data into your fine-tuning dataset. The model continues to “practice” the general distribution. Effective but requires access to pre-training data, which proprietary models don’t expose.

LoRA for preservation: Because LoRA freezes the original weights and only trains adapters, it inherently limits forgetting. The frozen weights retain pre-training knowledge; the adapters capture task-specific adjustments. This is underrated as a forgetting mitigation strategy.

Elastic Weight Consolidation (EWC): Penalizes changes to weights that were important for previous tasks, using the Fisher information matrix as a measure of importance. Works in principle, computationally expensive at scale.

Sequential training order: Start with more general fine-tuning (instruction following) before more specific fine-tuning (domain specialization). General → specific is more robust than the reverse.

In practice: LoRA + low learning rate + validation on general benchmarks (MMLU, HellaSwag) during training is sufficient for most production fine-tuning. Watch the general benchmark numbers; if they drop by more than 5-10%, you’re forgetting.

Evaluation: The Actual Hard Part

Training loss going down does not mean your model is getting better at what you care about. This is the most important operational lesson in fine-tuning.

Automated metrics (ROUGE for summarization, pass@k for code generation, accuracy on classification) give you fast signal but often miss what matters. A model that scores 2% higher on ROUGE-L might produce outputs that human evaluators prefer significantly less.

LLM-as-judge has become standard. You prompt a capable model (GPT-4, Claude) to rate your model’s outputs against a rubric. Faster than human eval, correlates reasonably well with human preference for most tasks, but introduces its own biases (length bias: longer outputs score higher; position bias: outputs shown first rate higher).

Human evaluation remains the ground truth but is expensive and slow. Budget for at least 500 rated examples per major checkpoint if accuracy matters for your use case.

The most important thing to measure: your actual task, not proxy benchmarks. If you’re fine-tuning a customer service model, evaluate on customer service conversations. Generic benchmarks tell you almost nothing about whether you’ll succeed.

The Commercial Landscape in 2026

Fine-tuning has split into two tiers:

Managed fine-tuning (OpenAI, Anthropic, Google, Azure OpenAI): Upload your data, pay per training token, get a fine-tuned model endpoint. Simple, no infrastructure required, but expensive per token and you don’t own the weights. OpenAI charges around $0.003/1K tokens for fine-tuning GPT-4o mini as of early 2025. Good for most business use cases.

Self-hosted fine-tuning (Llama 3, Mistral, Gemma, Phi): Hugging Face + transformers + TRL/PEFT libraries + your own GPU cluster or cloud instances. Full control, no per-token fees, but requires ML engineering expertise. The open-source ecosystem has matured significantly — Axolotl and Unsloth are popular training frameworks that abstract most of the LoRA/QLoRA complexity. A fine-tuned 8B model often performs comparably to a general-purpose 70B+ model on specific tasks at a fraction of the inference cost.

The business math usually works out like this: if you’re making more than ~10 million API calls per month on a specific task, it’s almost always cheaper to fine-tune an open-source model and self-host than to use a commercial API. Under that threshold, the engineering overhead typically isn’t worth it.

One Thing to Remember

The hardest part of fine-tuning isn’t the training — it’s knowing what you actually want the model to learn, and building data that teaches exactly that without side effects. The ML is mostly solved; the data curation and evaluation rarely are.

aimachine-learningfine-tuninglorarlhfdpoqlorapefttransformers