ControlNet Image Control in Python — Core Concepts

Understand how ControlNet adds spatial conditioning to diffusion models with edge maps, depth, poses, and how to use it in Python with diffusers.

ControlNet is an architectural extension for diffusion models that adds spatial conditioning — the ability to guide image generation using structural inputs like edge maps, depth maps, human poses, or segmentation masks. It solves one of the biggest frustrations with text-to-image models: you can describe what you want, but you cannot easily describe where things should go.

How ControlNet works

Standard Stable Diffusion takes a text prompt and random noise, then iteratively denoises to produce an image. ControlNet adds a parallel copy of the encoder blocks from the U-Net, connected via “zero convolution” layers. This parallel path processes your control image and injects spatial information into the denoising process at multiple scales.

The key insight: the zero convolutions start with weights initialized to zero, meaning ControlNet initially has no effect. During training, these weights gradually learn to incorporate the control signal without disrupting the pretrained model’s existing capabilities.

Control types

Different ControlNet models accept different kinds of guidance:

Canny edges: Extracts hard edges from a reference image. Good for maintaining the structural outline of objects while completely changing their appearance.

Depth maps: Encodes the relative distance of objects from the camera. Useful for preserving 3D spatial relationships — foreground objects stay in front, backgrounds stay behind.

OpenPose: Detects human body keypoints (joints, facial landmarks). Perfect for controlling character poses without specifying their appearance.

Segmentation maps: Color-coded regions indicating what should go where — sky here, building there, road at the bottom. Gives compositional control without fixing the exact shapes.

Scribbles and sketches: Rough hand-drawn lines that the model interprets loosely. The most forgiving control type.

Using ControlNet in Python

from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
)
from diffusers.utils import load_image
import torch

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

control_image = load_image("my_canny_edges.png")

result = pipe(
    "a futuristic cityscape at sunset, detailed architecture",
    image=control_image,
    num_inference_steps=30,
).images[0]

Conditioning scale

The controlnet_conditioning_scale parameter (0.0 to 2.0, default 1.0) controls how strictly the model follows your control image:

0.3–0.5: Loose guidance, model has creative freedom
0.7–1.0: Balanced, generally the sweet spot
1.2–1.5: Very strict adherence to the control structure
Above 1.5: Can produce artifacts as the model is forced too hard

Preprocessing control images

Raw photos need preprocessing before ControlNet can use them. The controlnet_aux library handles this:

from controlnet_aux import CannyDetector, OpenposeDetector

canny = CannyDetector()
edges = canny(source_image, low_threshold=100, high_threshold=200)

openpose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")
pose = openpose(source_image)

Common misconception

ControlNet does not modify or edit your source photo. It extracts structural information (edges, depth, pose) and uses that as a blueprint for generating an entirely new image. The output will match the structure but can look completely different in style, color, and content.

Multi-ControlNet

You can combine multiple control types simultaneously — for example, using both depth and pose to control both the spatial layout and character positions:

controlnets = [depth_controlnet, pose_controlnet]
control_images = [depth_map, pose_image]

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnets,
    torch_dtype=torch.float16,
).to("cuda")

result = pipe(
    prompt,
    image=control_images,
    controlnet_conditioning_scale=[0.8, 1.0],
).images[0]

One thing to remember: ControlNet bridges the gap between “describe what you want” and “show where you want it” by injecting spatial structure from edge maps, depth maps, or poses into the diffusion process, giving you precise compositional control.

pythoncontrolnetstable-diffusionimage-generation