YOLO Object Detection in Python — Deep Dive

Train, optimize, and deploy custom YOLO models in Python — from dataset prep through TensorRT export and edge inference.

Shipping a YOLO model to production requires more than calling model.predict(). This guide covers the full lifecycle: labeling strategy, training configuration, architecture internals, export optimization, and monitoring in deployment.

Architecture internals (YOLOv8)

YOLOv8 is composed of three blocks:

Backbone — CSPDarknet

The backbone extracts features at multiple scales using Cross-Stage Partial (CSP) blocks. CSP splits the feature map into two paths: one passes through a stack of bottleneck layers, the other skips directly to the output. Concatenating both paths preserves gradient flow while reducing computation by roughly 20% compared to a full ResNet-style backbone.

Neck — PANet with C2f modules

The Path Aggregation Network (PANet) fuses features from different backbone scales. Low-resolution, semantically rich features merge with high-resolution, spatially precise ones via top-down and bottom-up pathways. YOLOv8 replaces the older CSP bottleneck with C2f (Cross-Stage with 2 convolutions and flow) for better gradient flow.

Head — Decoupled anchor-free

YOLOv8 abandons anchor boxes entirely. The detection head splits into two parallel branches: one predicts bounding box regression (4 values per prediction), the other predicts class probabilities. This decoupling improves convergence because localization and classification gradients no longer interfere in a shared output layer.

The regression branch outputs distances from the grid cell center to the four box edges (left, top, right, bottom), using Distribution Focal Loss to model each distance as a probability distribution over discrete bins rather than a single value.

Dataset preparation

Labeling workflow

Use CVAT, Label Studio, or Roboflow for annotation. Export in YOLO format:

# labels/train/000001.txt
# class x_center y_center width height (all normalized 0-1)
0 0.4531 0.6200 0.1250 0.3400
2 0.7800 0.3100 0.0900 0.1800

Dataset YAML

path: /data/project
train: images/train
val: images/val
test: images/test

names:
  0: hardhat
  1: vest
  2: person

Data quality checklist

Class balance: If one class has 10× more instances, use class-weighted loss or oversample the minority.
Small objects: Ensure your training resolution is high enough. YOLOv8 defaults to 640px; bumping to 1280 helps for tiny objects but quadruples memory.
Negative samples: Include images with no objects to reduce false positives.
Label consistency: Audit with scripts — overlapping boxes, zero-area boxes, and out-of-bounds coordinates break training silently.

Training configuration

from ultralytics import YOLO

model = YOLO("yolov8m.pt")  # medium backbone

results = model.train(
    data="dataset.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,
    lrf=0.01,        # final LR = lr0 * lrf
    warmup_epochs=3,
    mosaic=1.0,       # mosaic augmentation probability
    mixup=0.1,
    close_mosaic=10,  # disable mosaic for last 10 epochs
    device="0",
)

Key training decisions

Image size: Larger sizes improve small-object detection at the cost of memory and speed. 640 is the sweet spot for most use cases; 1280 for aerial or medical imagery.

Mosaic augmentation: Combines four images into one, forcing the model to see objects at unusual scales and positions. Disabling it for the last 10 epochs lets the model fine-tune on clean, unmodified images.

Learning rate schedule: Ultralytics uses cosine annealing from lr0 down to lr0 * lrf. Warmup prevents early gradient spikes.

Early stopping: patience=20 stops training if val mAP plateaus, preventing overfitting.

Loss function breakdown

YOLOv8 combines three losses:

Box loss (CIoU): Complete IoU considers overlap, center distance, and aspect ratio. More stable gradients than vanilla IoU loss.
Classification loss (BCE): Binary cross-entropy per class allows multi-label predictions.
Distribution Focal Loss (DFL): Treats box regression as a distribution over discrete bins, improving localization of ambiguous edges.

The total loss is a weighted sum, with box loss weighted highest (typically 7.5), DFL at 1.5, and classification at 0.5.

Export and optimization

ONNX export

model.export(format="onnx", imgsz=640, simplify=True, opset=17)

TensorRT for NVIDIA GPUs

model.export(format="engine", imgsz=640, half=True, device=0)

FP16 TensorRT engines typically achieve 2–3× speedup over PyTorch FP32 with negligible accuracy loss. INT8 quantization needs a calibration dataset:

model.export(format="engine", imgsz=640, int8=True, data="dataset.yaml")

Edge deployment

Target	Format	Typical FPS
NVIDIA Jetson Orin	TensorRT FP16	90–120
Raspberry Pi 5	NCNN INT8	8–15
iPhone 15	CoreML FP16	30–45
Browser	TF.js	10–20

Inference pipeline in production

import cv2
from ultralytics import YOLO

model = YOLO("best.engine")  # TensorRT engine
cap = cv2.VideoCapture("rtsp://camera-feed")

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, conf=0.4, iou=0.45, verbose=False)

    for box in results[0].boxes:
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        label = model.names[int(box.cls)]
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f"{label} {box.conf:.2f}",
                    (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

Tracking integration

For video, pair YOLO with a tracker to maintain object identities across frames:

results = model.track(frame, persist=True, tracker="bytetrack.yaml")
for box in results[0].boxes:
    track_id = box.id  # consistent ID across frames

Monitoring and drift detection

Track these metrics in production:

Inference latency (p95): alert if it exceeds your FPS budget.
Detection count per frame: sudden drops may indicate model degradation or camera issues.
Confidence distribution: a shift toward lower confidence suggests domain drift.
Class distribution over time: seasonal changes (e.g., snow covering objects) change what the model sees.

Log predictions to a time-series database. Run weekly evaluations against a small labeled canary set to catch accuracy regression before users notice.

Common pitfalls

Training on resized images but deploying at native resolution. YOLO auto-resizes at inference, but aspect ratio padding changes. Always validate with actual deployment images.
Ignoring NMS tuning. Default IoU threshold of 0.7 works for sparse scenes. Dense scenes (packed shelves, crowds) need lower thresholds (0.3–0.5).
Over-augmenting. Heavy mosaic + mixup + rotation can confuse the model if your real data is clean and consistent.

The one thing to remember: YOLO’s speed comes from its single-pass architecture, but production success depends on dataset quality, export optimization for your target hardware, and continuous monitoring for drift.

pythonyoloobject-detectioncomputer-vision