Python Computer Vision for Autonomous Vehicles — Deep Dive

Build autonomous vehicle perception with Python: YOLOv8 detection, BEV transforms, monocular depth networks, multi-object tracking, and end-to-end learned driving models.

Camera Calibration and Geometry

Before any perception algorithm runs, cameras must be calibrated. Calibration estimates the intrinsic parameters (focal length, principal point, distortion coefficients) that relate 3D world points to 2D image pixels.

import cv2
import numpy as np

def calibrate_camera(images, pattern_size=(9, 6), square_size=0.025):
    """Calibrate using a checkerboard pattern."""
    obj_points = []  # 3D points in world coordinates
    img_points = []  # 2D points in image coordinates

    objp = np.zeros((pattern_size[0] * pattern_size[1], 3), np.float32)
    objp[:, :2] = np.mgrid[0:pattern_size[0], 0:pattern_size[1]].T.reshape(-1, 2)
    objp *= square_size

    for img in images:
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        ret, corners = cv2.findChessboardCorners(gray, pattern_size)
        if ret:
            corners_refined = cv2.cornerSubPix(
                gray, corners, (11, 11), (-1, -1),
                (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
            )
            obj_points.append(objp)
            img_points.append(corners_refined)

    ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
        obj_points, img_points, gray.shape[::-1], None, None
    )
    return K, dist  # K: 3x3 intrinsic matrix, dist: distortion coefficients

The intrinsic matrix K maps a 3D point [X, Y, Z] to a pixel [u, v]:

[u]     [fx  0  cx] [X]
[v]  =  [0  fy  cy] [Y] / Z
[1]     [0   0   1] [Z]

Object Detection with YOLOv8

YOLOv8 provides a clean Python API for training and inference:

from ultralytics import YOLO

# Load pretrained model (COCO classes include car, person, bicycle, etc.)
model = YOLO('yolov8m.pt')  # medium variant, good speed/accuracy balance

# Inference on a driving frame
results = model('driving_frame.jpg', conf=0.25)

for result in results:
    for box in result.boxes:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        confidence = box.conf[0].item()
        class_id = int(box.cls[0].item())
        class_name = model.names[class_id]
        print(f"{class_name}: ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f}) conf={confidence:.2f}")

Training on Driving-Specific Data

COCO pretrained models miss driving-specific classes like traffic cones, construction barriers, and emergency vehicles. Fine-tuning on driving datasets:

# Fine-tune on a custom driving dataset (YOLO format)
model = YOLO('yolov8m.pt')
model.train(
    data='driving_dataset.yaml',
    epochs=100,
    imgsz=1280,      # higher resolution for distant objects
    batch=8,
    augment=True,
    mosaic=1.0,       # mosaic augmentation (combines 4 images)
    mixup=0.1,
    hsv_h=0.015,
    hsv_s=0.7,
    hsv_v=0.4,
)

The driving dataset YAML defines classes:

names:
  0: car
  1: truck
  2: bus
  3: pedestrian
  4: cyclist
  5: motorcycle
  6: traffic_sign
  7: traffic_light
  8: cone

Bird’s Eye View (BEV) Transformation

Autonomous vehicles increasingly use Bird’s Eye View representations — projecting camera images onto a top-down map. This makes planning easier because distances and sizes are consistent regardless of camera perspective.

Inverse Perspective Mapping (IPM)

For flat ground, a homography transforms the front camera view to top-down:

def compute_bev_homography(K, extrinsic, ground_height=0):
    """Compute homography for flat-ground BEV projection."""
    R = extrinsic[:3, :3]
    t = extrinsic[:3, 3]

    # Ground plane normal in camera frame
    n = R @ np.array([0, 0, 1])  # world z-up
    d = -ground_height + t[2]

    H = K @ (R - np.outer(t, n) / d) @ np.linalg.inv(K)
    return H

def apply_bev(image, H, output_size=(600, 600)):
    return cv2.warpPerspective(image, H, output_size)

Learned BEV: BEVFormer

Modern approaches use transformer networks to lift 2D image features into 3D BEV space. BEVFormer attends to image features using 3D reference points and spatial cross-attention, producing a BEV feature map that captures both geometry and semantics. Training requires 3D annotations from datasets like nuScenes.

Monocular Depth Estimation

Depth Anything V2, a foundation model for monocular depth, produces dense depth maps:

import torch
from depth_anything_v2.dpt import DepthAnythingV2

model = DepthAnythingV2(encoder='vitl', features=256, out_channels=[256, 512, 1024, 1024])
model.load_state_dict(torch.load('depth_anything_v2_vitl.pth'))
model.eval()

# Predict relative depth
image = cv2.imread('frame.jpg')
depth = model.infer_image(image)  # returns HxW float array

# For metric depth, scale using known object sizes or camera height
camera_height = 1.5  # meters above ground
ground_depth_px = depth[ground_row, :]  # depth values at ground level
scale = camera_height / np.median(ground_depth_px)
metric_depth = depth * scale

Stereo Depth with OpenCV

When two cameras are available, geometric depth is more reliable:

def compute_stereo_depth(left_img, right_img, K, baseline):
    """Semi-global block matching for stereo depth."""
    stereo = cv2.StereoSGBM_create(
        minDisparity=0,
        numDisparities=128,
        blockSize=5,
        P1=8 * 3 * 5**2,
        P2=32 * 3 * 5**2,
        disp12MaxDiff=1,
        uniquenessRatio=10,
        speckleWindowSize=100,
        speckleRange=32
    )

    gray_l = cv2.cvtColor(left_img, cv2.COLOR_BGR2GRAY)
    gray_r = cv2.cvtColor(right_img, cv2.COLOR_BGR2GRAY)

    disparity = stereo.compute(gray_l, gray_r).astype(np.float32) / 16.0

    # Depth = focal_length * baseline / disparity
    focal = K[0, 0]
    depth = np.where(disparity > 0, focal * baseline / disparity, 0)
    return depth

Multi-Object Tracking: ByteTrack

ByteTrack associates detections across frames using both high and low confidence detections:

class SimpleTracker:
    """Simplified ByteTrack-style tracker."""
    def __init__(self, max_age=30):
        self.tracks = {}
        self.next_id = 0
        self.max_age = max_age

    def update(self, detections):
        """detections: list of (x1, y1, x2, y2, confidence, class_id)"""
        # Split into high and low confidence
        high_dets = [d for d in detections if d[4] > 0.5]
        low_dets = [d for d in detections if 0.1 < d[4] <= 0.5]

        # Predict track positions using simple velocity model
        for track_id, track in self.tracks.items():
            track['predicted'] = track['bbox'] + track.get('velocity', np.zeros(4))
            track['age'] += 1

        # Match high-confidence detections to tracks (IoU matching)
        matched, unmatched_tracks, unmatched_dets = self._match(
            high_dets, list(self.tracks.keys()), threshold=0.3
        )

        # Update matched tracks
        for det_idx, track_id in matched:
            self._update_track(track_id, high_dets[det_idx])

        # Try matching low-confidence detections to remaining tracks
        if low_dets and unmatched_tracks:
            matched_low, still_unmatched, _ = self._match(
                low_dets, unmatched_tracks, threshold=0.5
            )
            for det_idx, track_id in matched_low:
                self._update_track(track_id, low_dets[det_idx])
            unmatched_tracks = still_unmatched

        # Create new tracks from unmatched high-confidence detections
        for det_idx in unmatched_dets:
            self._create_track(high_dets[det_idx])

        # Remove stale tracks
        self.tracks = {
            k: v for k, v in self.tracks.items() if v['age'] < self.max_age
        }

        return {k: v['bbox'] for k, v in self.tracks.items()}

    def _update_track(self, track_id, detection):
        old_bbox = self.tracks[track_id]['bbox']
        new_bbox = np.array(detection[:4])
        self.tracks[track_id]['velocity'] = new_bbox - old_bbox
        self.tracks[track_id]['bbox'] = new_bbox
        self.tracks[track_id]['age'] = 0

    def _create_track(self, detection):
        self.tracks[self.next_id] = {
            'bbox': np.array(detection[:4]),
            'velocity': np.zeros(4),
            'age': 0,
            'class_id': detection[5]
        }
        self.next_id += 1

Lane Detection with Post-Processing

Lane detection outputs curves that need converting to waypoints for the planner:

def fit_lane_polynomial(lane_points, image_height, degree=2):
    """Fit a polynomial to detected lane points."""
    ys = np.array([p[1] for p in lane_points])
    xs = np.array([p[0] for p in lane_points])

    # Fit x = f(y) because lanes are roughly vertical in image
    coeffs = np.polyfit(ys, xs, degree)
    poly = np.poly1d(coeffs)

    # Generate smooth lane curve
    y_range = np.linspace(ys.min(), image_height, 100)
    x_smooth = poly(y_range)

    return list(zip(x_smooth, y_range)), coeffs

def compute_lane_curvature(coeffs, y_eval):
    """Compute radius of curvature at a given y position."""
    # For x = ay^2 + by + c
    a, b = coeffs[0], coeffs[1]
    curvature_radius = ((1 + (2*a*y_eval + b)**2)**1.5) / abs(2*a)
    return curvature_radius

End-to-End Learned Driving

The frontier of autonomous driving research replaces the modular pipeline (detect → track → plan → control) with a single neural network that maps camera images directly to driving commands.

Tesla’s FSD uses a hybrid approach: a neural network backbone processes all cameras simultaneously, produces BEV features, and feeds into both learned and rule-based planning modules. UniAD (Unified Autonomous Driving) chains perception, prediction, and planning in a single transformer architecture.

Training end-to-end models requires massive datasets of human driving demonstrations with recorded steering, throttle, and brake. The model learns to imitate expert drivers, with additional safety constraints from simulation testing.

Inference Performance

Real-time requirements for autonomous driving at 30 FPS with 8 cameras:

Component	Model	GPU Time per Frame	Hardware
Detection	YOLOv8-M	6 ms	NVIDIA Orin
Segmentation	SegFormer-B2	12 ms	NVIDIA Orin
Depth	Depth Anything V2-S	8 ms	NVIDIA Orin
Tracking	ByteTrack	1 ms	CPU
BEV Transform	BEVFormer-tiny	35 ms	NVIDIA Orin
Total budget	—	< 33 ms	—

Models are exported to TensorRT (NVIDIA) or ONNX Runtime for inference optimization. Python handles training and offline evaluation; C++ handles real-time inference on the vehicle.

One thing to remember: Autonomous vehicle vision combines detection (YOLO), segmentation (SegFormer), depth estimation (stereo or monocular networks), and tracking (ByteTrack) into a pipeline that must process all cameras within 33 milliseconds — and Python is where these models are built, trained, and validated before deployment.

pythoncomputer-visionautonomous-vehiclesdeep-learningopencv