Python Computer Vision for Autonomous Vehicles — Deep Dive
Camera Calibration and Geometry
Before any perception algorithm runs, cameras must be calibrated. Calibration estimates the intrinsic parameters (focal length, principal point, distortion coefficients) that relate 3D world points to 2D image pixels.
import cv2
import numpy as np
def calibrate_camera(images, pattern_size=(9, 6), square_size=0.025):
"""Calibrate using a checkerboard pattern."""
obj_points = [] # 3D points in world coordinates
img_points = [] # 2D points in image coordinates
objp = np.zeros((pattern_size[0] * pattern_size[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:pattern_size[0], 0:pattern_size[1]].T.reshape(-1, 2)
objp *= square_size
for img in images:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, pattern_size)
if ret:
corners_refined = cv2.cornerSubPix(
gray, corners, (11, 11), (-1, -1),
(cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
)
obj_points.append(objp)
img_points.append(corners_refined)
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
obj_points, img_points, gray.shape[::-1], None, None
)
return K, dist # K: 3x3 intrinsic matrix, dist: distortion coefficients
The intrinsic matrix K maps a 3D point [X, Y, Z] to a pixel [u, v]:
[u] [fx 0 cx] [X]
[v] = [0 fy cy] [Y] / Z
[1] [0 0 1] [Z]
Object Detection with YOLOv8
YOLOv8 provides a clean Python API for training and inference:
from ultralytics import YOLO
# Load pretrained model (COCO classes include car, person, bicycle, etc.)
model = YOLO('yolov8m.pt') # medium variant, good speed/accuracy balance
# Inference on a driving frame
results = model('driving_frame.jpg', conf=0.25)
for result in results:
for box in result.boxes:
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
confidence = box.conf[0].item()
class_id = int(box.cls[0].item())
class_name = model.names[class_id]
print(f"{class_name}: ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f}) conf={confidence:.2f}")
Training on Driving-Specific Data
COCO pretrained models miss driving-specific classes like traffic cones, construction barriers, and emergency vehicles. Fine-tuning on driving datasets:
# Fine-tune on a custom driving dataset (YOLO format)
model = YOLO('yolov8m.pt')
model.train(
data='driving_dataset.yaml',
epochs=100,
imgsz=1280, # higher resolution for distant objects
batch=8,
augment=True,
mosaic=1.0, # mosaic augmentation (combines 4 images)
mixup=0.1,
hsv_h=0.015,
hsv_s=0.7,
hsv_v=0.4,
)
The driving dataset YAML defines classes:
names:
0: car
1: truck
2: bus
3: pedestrian
4: cyclist
5: motorcycle
6: traffic_sign
7: traffic_light
8: cone
Bird’s Eye View (BEV) Transformation
Autonomous vehicles increasingly use Bird’s Eye View representations — projecting camera images onto a top-down map. This makes planning easier because distances and sizes are consistent regardless of camera perspective.
Inverse Perspective Mapping (IPM)
For flat ground, a homography transforms the front camera view to top-down:
def compute_bev_homography(K, extrinsic, ground_height=0):
"""Compute homography for flat-ground BEV projection."""
R = extrinsic[:3, :3]
t = extrinsic[:3, 3]
# Ground plane normal in camera frame
n = R @ np.array([0, 0, 1]) # world z-up
d = -ground_height + t[2]
H = K @ (R - np.outer(t, n) / d) @ np.linalg.inv(K)
return H
def apply_bev(image, H, output_size=(600, 600)):
return cv2.warpPerspective(image, H, output_size)
Learned BEV: BEVFormer
Modern approaches use transformer networks to lift 2D image features into 3D BEV space. BEVFormer attends to image features using 3D reference points and spatial cross-attention, producing a BEV feature map that captures both geometry and semantics. Training requires 3D annotations from datasets like nuScenes.
Monocular Depth Estimation
Depth Anything V2, a foundation model for monocular depth, produces dense depth maps:
import torch
from depth_anything_v2.dpt import DepthAnythingV2
model = DepthAnythingV2(encoder='vitl', features=256, out_channels=[256, 512, 1024, 1024])
model.load_state_dict(torch.load('depth_anything_v2_vitl.pth'))
model.eval()
# Predict relative depth
image = cv2.imread('frame.jpg')
depth = model.infer_image(image) # returns HxW float array
# For metric depth, scale using known object sizes or camera height
camera_height = 1.5 # meters above ground
ground_depth_px = depth[ground_row, :] # depth values at ground level
scale = camera_height / np.median(ground_depth_px)
metric_depth = depth * scale
Stereo Depth with OpenCV
When two cameras are available, geometric depth is more reliable:
def compute_stereo_depth(left_img, right_img, K, baseline):
"""Semi-global block matching for stereo depth."""
stereo = cv2.StereoSGBM_create(
minDisparity=0,
numDisparities=128,
blockSize=5,
P1=8 * 3 * 5**2,
P2=32 * 3 * 5**2,
disp12MaxDiff=1,
uniquenessRatio=10,
speckleWindowSize=100,
speckleRange=32
)
gray_l = cv2.cvtColor(left_img, cv2.COLOR_BGR2GRAY)
gray_r = cv2.cvtColor(right_img, cv2.COLOR_BGR2GRAY)
disparity = stereo.compute(gray_l, gray_r).astype(np.float32) / 16.0
# Depth = focal_length * baseline / disparity
focal = K[0, 0]
depth = np.where(disparity > 0, focal * baseline / disparity, 0)
return depth
Multi-Object Tracking: ByteTrack
ByteTrack associates detections across frames using both high and low confidence detections:
class SimpleTracker:
"""Simplified ByteTrack-style tracker."""
def __init__(self, max_age=30):
self.tracks = {}
self.next_id = 0
self.max_age = max_age
def update(self, detections):
"""detections: list of (x1, y1, x2, y2, confidence, class_id)"""
# Split into high and low confidence
high_dets = [d for d in detections if d[4] > 0.5]
low_dets = [d for d in detections if 0.1 < d[4] <= 0.5]
# Predict track positions using simple velocity model
for track_id, track in self.tracks.items():
track['predicted'] = track['bbox'] + track.get('velocity', np.zeros(4))
track['age'] += 1
# Match high-confidence detections to tracks (IoU matching)
matched, unmatched_tracks, unmatched_dets = self._match(
high_dets, list(self.tracks.keys()), threshold=0.3
)
# Update matched tracks
for det_idx, track_id in matched:
self._update_track(track_id, high_dets[det_idx])
# Try matching low-confidence detections to remaining tracks
if low_dets and unmatched_tracks:
matched_low, still_unmatched, _ = self._match(
low_dets, unmatched_tracks, threshold=0.5
)
for det_idx, track_id in matched_low:
self._update_track(track_id, low_dets[det_idx])
unmatched_tracks = still_unmatched
# Create new tracks from unmatched high-confidence detections
for det_idx in unmatched_dets:
self._create_track(high_dets[det_idx])
# Remove stale tracks
self.tracks = {
k: v for k, v in self.tracks.items() if v['age'] < self.max_age
}
return {k: v['bbox'] for k, v in self.tracks.items()}
def _update_track(self, track_id, detection):
old_bbox = self.tracks[track_id]['bbox']
new_bbox = np.array(detection[:4])
self.tracks[track_id]['velocity'] = new_bbox - old_bbox
self.tracks[track_id]['bbox'] = new_bbox
self.tracks[track_id]['age'] = 0
def _create_track(self, detection):
self.tracks[self.next_id] = {
'bbox': np.array(detection[:4]),
'velocity': np.zeros(4),
'age': 0,
'class_id': detection[5]
}
self.next_id += 1
Lane Detection with Post-Processing
Lane detection outputs curves that need converting to waypoints for the planner:
def fit_lane_polynomial(lane_points, image_height, degree=2):
"""Fit a polynomial to detected lane points."""
ys = np.array([p[1] for p in lane_points])
xs = np.array([p[0] for p in lane_points])
# Fit x = f(y) because lanes are roughly vertical in image
coeffs = np.polyfit(ys, xs, degree)
poly = np.poly1d(coeffs)
# Generate smooth lane curve
y_range = np.linspace(ys.min(), image_height, 100)
x_smooth = poly(y_range)
return list(zip(x_smooth, y_range)), coeffs
def compute_lane_curvature(coeffs, y_eval):
"""Compute radius of curvature at a given y position."""
# For x = ay^2 + by + c
a, b = coeffs[0], coeffs[1]
curvature_radius = ((1 + (2*a*y_eval + b)**2)**1.5) / abs(2*a)
return curvature_radius
End-to-End Learned Driving
The frontier of autonomous driving research replaces the modular pipeline (detect → track → plan → control) with a single neural network that maps camera images directly to driving commands.
Tesla’s FSD uses a hybrid approach: a neural network backbone processes all cameras simultaneously, produces BEV features, and feeds into both learned and rule-based planning modules. UniAD (Unified Autonomous Driving) chains perception, prediction, and planning in a single transformer architecture.
Training end-to-end models requires massive datasets of human driving demonstrations with recorded steering, throttle, and brake. The model learns to imitate expert drivers, with additional safety constraints from simulation testing.
Inference Performance
Real-time requirements for autonomous driving at 30 FPS with 8 cameras:
| Component | Model | GPU Time per Frame | Hardware |
|---|---|---|---|
| Detection | YOLOv8-M | 6 ms | NVIDIA Orin |
| Segmentation | SegFormer-B2 | 12 ms | NVIDIA Orin |
| Depth | Depth Anything V2-S | 8 ms | NVIDIA Orin |
| Tracking | ByteTrack | 1 ms | CPU |
| BEV Transform | BEVFormer-tiny | 35 ms | NVIDIA Orin |
| Total budget | — | < 33 ms | — |
Models are exported to TensorRT (NVIDIA) or ONNX Runtime for inference optimization. Python handles training and offline evaluation; C++ handles real-time inference on the vehicle.
One thing to remember: Autonomous vehicle vision combines detection (YOLO), segmentation (SegFormer), depth estimation (stereo or monocular networks), and tracking (ByteTrack) into a pipeline that must process all cameras within 33 milliseconds — and Python is where these models are built, trained, and validated before deployment.
See Also
- Python Behavior Trees Robotics How robots make decisions using a tree-shaped rulebook that keeps them organized, like a flowchart that tells a robot what to do in every situation.
- Python Bluetooth Ble How Python connects to fitness trackers, smart locks, and wireless sensors using the invisible radio signals all around you.
- Python Circuitpython Hardware Why CircuitPython makes wiring up LEDs, sensors, and motors as easy as plugging in a USB drive.
- Python Home Assistant Automation How Python turns your home into a smart home that reacts to you automatically, like a helpful invisible butler.
- Python Lidar Point Cloud Processing How self-driving cars use millions of laser dots to build a 3D picture of the world around them, and how Python helps make sense of it all.