Coral TPU Inference with Python — Deep Dive

Build production Coral TPU pipelines in Python with PyCoral, model compilation, multi-TPU setups, and thermal management strategies.

Setting Up the Coral Environment

Installation

# Add Coral package repository
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | \
  sudo tee /etc/apt/sources.list.d/coral-edgetpu.list

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
  sudo apt-key add -

sudo apt update
sudo apt install libedgetpu1-std python3-pycoral

# Or install PyCoral via pip
pip install pycoral

The libedgetpu1-std package provides the standard-clock runtime. There’s also libedgetpu1-max which runs the TPU at maximum clock speed — faster inference but more heat and higher power draw.

Model Compilation

Preparing the INT8 Model

import tensorflow as tf
import numpy as np

def representative_dataset():
    for image_path in calibration_images[:300]:
        img = tf.io.read_file(image_path)
        img = tf.image.decode_jpeg(img, channels=3)
        img = tf.image.resize(img, [224, 224])
        img = tf.cast(img, tf.float32) / 255.0
        yield [tf.expand_dims(img, 0)]

converter = tf.lite.TFLiteConverter.from_saved_model("efficientnet_saved_model")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()

with open("model_quant.tflite", "wb") as f:
    f.write(tflite_model)

Running the Edge TPU Compiler

# Install the compiler
sudo apt install edgetpu-compiler

# Compile the model
edgetpu_compiler model_quant.tflite

# Output: model_quant_edgetpu.tflite
# Also prints which ops mapped to TPU vs CPU

The compiler output tells you exactly which layers run on the TPU. Watch for messages like “1 operation will run on the host CPU” — those indicate potential bottlenecks.

Co-compilation for Model Pipelining

For models too large for a single TPU’s 8 MB SRAM, you can segment across multiple TPUs:

edgetpu_compiler model_quant.tflite --num_segments=2
# Produces: model_quant_segment_0_of_2_edgetpu.tflite
#           model_quant_segment_1_of_2_edgetpu.tflite

PyCoral API: Classification

from pycoral.adapters import classify
from pycoral.adapters import common
from pycoral.utils.dataset import read_label_file
from pycoral.utils.edgetpu import make_interpreter
from PIL import Image

# Create interpreter with Edge TPU delegate
interpreter = make_interpreter("model_edgetpu.tflite")
interpreter.allocate_tensors()

# Load labels
labels = read_label_file("labels.txt")

# Prepare input
image = Image.open("test_image.jpg")
size = common.input_size(interpreter)
image = image.resize(size, Image.LANCZOS)

# Run inference
common.set_input(interpreter, image)
interpreter.invoke()

# Get top-5 results
classes = classify.get_classes(interpreter, top_k=5, score_threshold=0.1)

for c in classes:
    print(f"{labels.get(c.id, 'unknown')}: {c.score:.4f}")

PyCoral API: Object Detection

from pycoral.adapters import detect
from pycoral.adapters import common
from pycoral.utils.edgetpu import make_interpreter
from PIL import Image, ImageDraw

interpreter = make_interpreter("ssd_mobilenet_v2_edgetpu.tflite")
interpreter.allocate_tensors()

image = Image.open("street_scene.jpg")
_, scale = common.set_resized_input(
    interpreter,
    image.size,
    lambda size: image.resize(size, Image.LANCZOS)
)

interpreter.invoke()

objects = detect.get_objects(interpreter, score_threshold=0.4, image_scale=scale)

draw = ImageDraw.Draw(image)
for obj in objects:
    bbox = obj.bbox
    draw.rectangle(
        [(bbox.xmin, bbox.ymin), (bbox.xmax, bbox.ymax)],
        outline="red",
        width=2
    )
    draw.text((bbox.xmin, bbox.ymin - 15), f"{obj.id}: {obj.score:.2f}")

image.save("detections.jpg")

Multi-TPU Pipeline

When connecting multiple USB Accelerators, each needs explicit device assignment:

from pycoral.utils.edgetpu import list_edge_tpus, make_interpreter

# List connected TPUs
tpus = list_edge_tpus()
print(f"Found {len(tpus)} Edge TPU(s): {tpus}")

# Assign models to specific TPUs
interpreter_classify = make_interpreter(
    "classifier_edgetpu.tflite",
    device="usb:0"
)
interpreter_detect = make_interpreter(
    "detector_edgetpu.tflite",
    device="usb:1"
)

interpreter_classify.allocate_tensors()
interpreter_detect.allocate_tensors()

Pipelined Model Across TPUs

For segmented models (co-compiled), each segment runs on a different TPU:

from pycoral.pipeline import pipelined_model_runner

runners = []
for i in range(num_segments):
    interp = make_interpreter(
        f"model_segment_{i}_of_{num_segments}_edgetpu.tflite",
        device=f"usb:{i}"
    )
    interp.allocate_tensors()
    runners.append(interp)

runner = pipelined_model_runner.PipelinedModelRunner(runners)

# Push input — output arrives asynchronously
runner.push({"input_tensor": input_data})
output = runner.pop()

Thermal Management

The Edge TPU generates significant heat during sustained inference. Without cooling:

0-30 seconds: Full performance (~4 TOPS)
30-60 seconds: Throttles to ~3 TOPS
60+ seconds: Stabilizes around 2-2.5 TOPS

Monitoring Temperature

import subprocess
import re

def get_tpu_temp():
    """Read Edge TPU temperature (Dev Board only)."""
    result = subprocess.run(
        ["cat", "/sys/class/thermal/thermal_zone0/temp"],
        capture_output=True, text=True
    )
    return int(result.stdout.strip()) / 1000  # Convert millidegrees to degrees

# For USB Accelerator, monitor inference latency as a proxy
import time

def benchmark_with_thermal_awareness(interpreter, input_data, duration_sec=120):
    latencies = []
    start_time = time.time()

    while time.time() - start_time < duration_sec:
        t0 = time.perf_counter()
        interpreter.set_tensor(input_details[0]["index"], input_data)
        interpreter.invoke()
        latencies.append(time.perf_counter() - t0)

    # Detect throttling: compare first 10% vs last 10%
    n = len(latencies)
    early_avg = sum(latencies[:n//10]) / (n//10)
    late_avg = sum(latencies[-n//10:]) / (n//10)

    print(f"Early avg: {early_avg*1000:.2f}ms")
    print(f"Late avg:  {late_avg*1000:.2f}ms")
    print(f"Throttle factor: {late_avg/early_avg:.2f}x")

Mitigation Strategies

Duty cycling — run inference in bursts with cooldown periods
Heat sinks — attach aluminum or copper heat sinks to the USB Accelerator
Active cooling — small fan pointed at the device (most effective)
libedgetpu1-std vs libedgetpu1-max — use standard clock for sustained workloads

Transfer Learning on Edge TPU

Coral supports on-device transfer learning via the ImprintingEngine:

from pycoral.learn.imprinting.engine import ImprintingEngine
from pycoral.utils.edgetpu import make_interpreter

# Base model must be a specially prepared "imprinting" model
engine = ImprintingEngine("mobilenet_v1_1.0_224_quant_embedding_extractor_edgetpu.tflite")

# Train new classes from a few examples
for class_id, image_paths in training_data.items():
    for path in image_paths:
        img = Image.open(path).resize((224, 224))
        engine.train(img, class_id)

# Save the retrained model
engine.save_model("retrained_edgetpu.tflite")

This doesn’t actually train weights — it extracts feature embeddings and fits a nearest-class-mean classifier on top. Works with as few as 1-5 images per class.

Production Checklist

Model runs 100% on TPU (no CPU fallback ops)
Calibration dataset represents real deployment data
Accuracy validated against original float model (target: <2% drop)
Latency benchmarked at sustained load (not just single inference)
Thermal solution tested (heat sink or fan for continuous operation)
USB bandwidth checked if sharing with cameras or other peripherals
Error handling for TPU disconnection/reconnection
Fallback to CPU inference if TPU unavailable
Model versioning and OTA update mechanism in place

The one thing to remember: Coral TPU development in Python follows a strict pipeline — full INT8 quantization, Edge TPU compilation, and PyCoral API integration — where the critical success factor is ensuring all model operations map to the TPU hardware and thermal constraints are managed for sustained workloads.

pythonmachine-learningedge-computing