LoRA Fine-Tuning in Python — Core Concepts
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that modifies the behavior of a pretrained model by injecting small trainable matrices into its layers, while keeping the original weights frozen. Instead of updating millions or billions of parameters, LoRA trains a fraction — typically 0.1% to 1% of the original model size.
The core idea
In a standard neural network layer, the weight matrix W might be 4096 × 4096 — roughly 16 million parameters. Full fine-tuning adjusts all of them. LoRA instead adds two small matrices: A (4096 × rank) and B (rank × 4096), where rank is a small number like 4, 8, or 16.
During inference, the output becomes: output = W·x + B·A·x
The product B·A has the same dimensions as W, but because rank is small, A and B together contain far fewer parameters. With rank 8, you go from 16 million trainable parameters to about 65 thousand — a 250x reduction.
Why it works
Large language models and image generators are vastly over-parameterized. Research shows that the weight changes needed for task-specific adaptation tend to live in a low-rank subspace. LoRA exploits this by constraining updates to a low-rank decomposition, which acts as a regularizer and often produces better results than full fine-tuning on small datasets.
LoRA for image generation
Training a LoRA for Stable Diffusion typically targets the cross-attention layers of the U-Net — the layers where text conditioning meets image features:
from diffusers import StableDiffusionPipeline
import torch
# Load base model
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
# Load a trained LoRA
pipe.load_lora_weights("path/to/lora/weights")
# Generate with LoRA influence
image = pipe(
"a portrait in the style of <trained_concept>",
num_inference_steps=30,
).images[0]
# Remove LoRA to restore base model behavior
pipe.unload_lora_weights()
LoRA for language models with PEFT
The peft library from Hugging Face provides a clean interface for applying LoRA to any transformer:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%
Key parameters
Rank (r): Controls the capacity of the adaptation. Rank 4 works for simple style changes; rank 16–64 for complex behavioral shifts. Higher rank means more parameters and longer training.
Alpha: A scaling factor applied to the LoRA output. The effective scaling is alpha / rank. Common practice sets alpha to twice the rank value.
Target modules: Which layers get LoRA adapters. For attention-based models, query and value projections are standard targets. Adding key projections and feed-forward layers increases capacity but also training time.
Combining multiple LoRAs
LoRAs can be stacked with different weights:
pipe.load_lora_weights("style_lora", adapter_name="style")
pipe.load_lora_weights("lighting_lora", adapter_name="lighting")
pipe.set_adapters(["style", "lighting"], adapter_weights=[0.8, 0.5])
This lets you mix a style LoRA at 80% strength with a lighting LoRA at 50% strength, composing effects that neither achieves alone.
Common misconception
LoRA does not compress or simplify the base model. The original model runs exactly as before — LoRA adds a small parallel path. At inference time, the LoRA matrices can be merged into the base weights for zero overhead, but the model itself does not become smaller.
When to use LoRA vs. full fine-tuning
| Scenario | LoRA | Full fine-tuning |
|---|---|---|
| Small dataset (< 10k examples) | Preferred | Risk of overfitting |
| Consumer GPU (8–16 GB) | Fits easily | Often impossible |
| Multiple specialized models | Store small adapter files | Store full model copies |
| Maximum quality, unlimited budget | Good but not always best | Can squeeze more performance |
| Quick iteration | Minutes to hours | Hours to days |
One thing to remember: LoRA decomposes weight updates into two small matrices that capture task-specific changes in a fraction of the space, making fine-tuning accessible on consumer hardware while keeping the original model intact.
See Also
- Diffusion Models Stable Diffusion and DALL-E don't 'draw' your images — they unspoil a scrambled mess until a picture emerges. Here's the surprisingly simple idea behind it.
- Python Controlnet Image Control Find out how ControlNet lets you boss around an AI artist by giving it sketches, poses, and outlines to follow.
- Python Gan Training Patterns Learn how two neural networks compete like an art forger and a detective to create incredibly realistic fake images.
- Python Image Generation Pipelines Discover how Python chains together multiple steps to turn your ideas into polished AI-generated images, like a factory assembly line for pictures.
- Python Image Inpainting Learn how Python can magically fill in missing parts of a photo, like erasing something and having the picture fix itself.