LoRA Fine-Tuning in Python — Core Concepts

Understand how LoRA adapts large models efficiently by training small weight matrices, and how to apply it in Python with PEFT and diffusers.

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that modifies the behavior of a pretrained model by injecting small trainable matrices into its layers, while keeping the original weights frozen. Instead of updating millions or billions of parameters, LoRA trains a fraction — typically 0.1% to 1% of the original model size.

The core idea

In a standard neural network layer, the weight matrix W might be 4096 × 4096 — roughly 16 million parameters. Full fine-tuning adjusts all of them. LoRA instead adds two small matrices: A (4096 × rank) and B (rank × 4096), where rank is a small number like 4, 8, or 16.

During inference, the output becomes: output = W·x + B·A·x

The product B·A has the same dimensions as W, but because rank is small, A and B together contain far fewer parameters. With rank 8, you go from 16 million trainable parameters to about 65 thousand — a 250x reduction.

Why it works

Large language models and image generators are vastly over-parameterized. Research shows that the weight changes needed for task-specific adaptation tend to live in a low-rank subspace. LoRA exploits this by constraining updates to a low-rank decomposition, which acts as a regularizer and often produces better results than full fine-tuning on small datasets.

LoRA for image generation

Training a LoRA for Stable Diffusion typically targets the cross-attention layers of the U-Net — the layers where text conditioning meets image features:

from diffusers import StableDiffusionPipeline
import torch

# Load base model
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# Load a trained LoRA
pipe.load_lora_weights("path/to/lora/weights")

# Generate with LoRA influence
image = pipe(
    "a portrait in the style of <trained_concept>",
    num_inference_steps=30,
).images[0]

# Remove LoRA to restore base model behavior
pipe.unload_lora_weights()

LoRA for language models with PEFT

The peft library from Hugging Face provides a clean interface for applying LoRA to any transformer:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

lora_config = LoraConfig(
    r=16,                      # rank
    lora_alpha=32,             # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%

Key parameters

Rank (r): Controls the capacity of the adaptation. Rank 4 works for simple style changes; rank 16–64 for complex behavioral shifts. Higher rank means more parameters and longer training.

Alpha: A scaling factor applied to the LoRA output. The effective scaling is alpha / rank. Common practice sets alpha to twice the rank value.

Target modules: Which layers get LoRA adapters. For attention-based models, query and value projections are standard targets. Adding key projections and feed-forward layers increases capacity but also training time.

Combining multiple LoRAs

LoRAs can be stacked with different weights:

pipe.load_lora_weights("style_lora", adapter_name="style")
pipe.load_lora_weights("lighting_lora", adapter_name="lighting")
pipe.set_adapters(["style", "lighting"], adapter_weights=[0.8, 0.5])

This lets you mix a style LoRA at 80% strength with a lighting LoRA at 50% strength, composing effects that neither achieves alone.

Common misconception

LoRA does not compress or simplify the base model. The original model runs exactly as before — LoRA adds a small parallel path. At inference time, the LoRA matrices can be merged into the base weights for zero overhead, but the model itself does not become smaller.

When to use LoRA vs. full fine-tuning

Scenario	LoRA	Full fine-tuning
Small dataset (< 10k examples)	Preferred	Risk of overfitting
Consumer GPU (8–16 GB)	Fits easily	Often impossible
Multiple specialized models	Store small adapter files	Store full model copies
Maximum quality, unlimited budget	Good but not always best	Can squeeze more performance
Quick iteration	Minutes to hours	Hours to days

One thing to remember: LoRA decomposes weight updates into two small matrices that capture task-specific changes in a fraction of the space, making fine-tuning accessible on consumer hardware while keeping the original model intact.

pythonlorafine-tuningmachine-learning