Edge AI — Core Concepts

Why Edge AI Requires Specialized Hardware

General-purpose CPUs are flexible but inefficient for neural network inference. Matrix multiplication (the core operation in neural networks) is embarrassingly parallel — millions of multiply-accumulate operations that can be computed simultaneously. A CPU processes these sequentially.

GPUs accelerate this with thousands of cores, but GPUs consume hundreds of watts — impossible for battery-powered devices. Dedicated neural processing units (NPUs) or digital signal processors (DSPs) are designed specifically for neural network operations:

  • Apple Neural Engine (ANE): 16-core, 11 TOPS (trillion operations per second) in M2 chips. Handles Core ML model inference, FaceID, Siri, and camera computational photography.
  • Qualcomm Hexagon DSP + AI Engine: Used in Snapdragon processors for Android. Supports INT8 quantized models efficiently, powers on-device voice and vision features.
  • Google Tensor G3 (Pixel phones): Custom TPU-derived architecture. Used for Google Photos organization, speech recognition, and on-device translation.

The key advantage: these chips are deeply power-efficient for specific inference workloads. The Apple ANE delivers 11 TOPS at <1W — compared to a GPU that might deliver 100 TOPS but at 200-300W.

The Model Optimization Pipeline for Edge

Cloud AI inference can tolerate large, slow models. Edge AI requires aggressive optimization:

Quantization (see model-quantization topic): INT8 typically, INT4 for very constrained devices. Apple’s Core ML and TFLite support per-channel quantization with hardware-accelerated INT8 inference.

Pruning (see model-pruning topic): Structural pruning (removing entire filters or attention heads) for direct inference speedup without sparse kernel requirements.

Knowledge Distillation (see knowledge-distillation topic): Train large model, distill to smaller architecture. MobileNet, EfficientNet-Lite, DistilBERT were designed as distillation targets for edge deployment.

Architecture optimization: Use architecture search (NAS) to find designs optimal for specific hardware. MobileNetV3 and MNASNet were found via hardware-aware NAS targeting Pixel phone inference time.

Operator fusion: Merge multiple operations (Conv + BatchNorm + ReLU) into a single hardware-efficient kernel. Reduces memory bandwidth (the bottleneck for edge inference) by avoiding intermediate tensor writes.

TinyML: AI in Microcontrollers

The extreme edge: microcontrollers with kilobytes of RAM and no operating system. TinyML makes AI possible on these devices.

Resource constraints:

  • RAM: 64KB–512KB (vs. GBs in phones)
  • Flash storage: 256KB–4MB
  • No GPU or NPU
  • Clock speed: 32MHz–240MHz
  • Power: microwatts to milliwatts

Representative hardware:

  • Arduino Nano 33 BLE: 256KB RAM, Cortex-M4 (64MHz)
  • STM32H7: 1MB RAM, Cortex-M7 (480MHz) — much more capable
  • Raspberry Pi Pico: 264KB RAM, dual Cortex-M0+

TensorFlow Lite Micro (TFLM): Subset of TFLite that runs without dynamic memory allocation (microcontrollers don’t have heap allocators). Requires memory planning at compile time.

Use cases:

  • Keyword spotting: “Hey Siri”, “OK Google” — always-on, <1mW, detecting specific audio patterns. Classic example: yes/no classification from audio spectrograms.
  • Gesture recognition: Accelerometer/gyroscope data → gesture classification on the wrist (smartwatch, fitness tracker)
  • Predictive maintenance: Vibration sensor on industrial motors → detect anomalous patterns indicating imminent failure
  • Thermal camera + TinyML: Fire detection in early-stage scenarios on standalone sensors

MCUNet (MIT, 2020): NAS specifically for microcontrollers. Jointly designs the model architecture and inference scheduling to fit within SRAM constraints. Achieved ImageNet top-1 of 70% within 256KB SRAM — previously considered impossible.

Edge vs. Cloud Trade-offs in Practice

The decision of where to run AI (edge vs. cloud) involves multiple factors:

FactorEdgeCloud
Latency<10ms50-500ms
PrivacyHigh (data stays local)Lower (data transmitted)
CostOne-time hardwarePer-inference pricing
Model sizeConstrained (MB)Unconstrained (GB)
UpdatesManual OTA neededInstant
ReliabilityWorks offlineRequires connectivity
BatteryHigh concernNo concern

Hybrid approaches: Common in practice. Keyword spotting always on device (low power, real-time). Complex NLP processing sent to cloud when connected. If offline, degrade gracefully to device-only capabilities.

Split computing: Divide the model between device and cloud. Run early layers on device (extracting features), send compact feature representation to cloud (not raw data), continue inference. Reduces data transmission and provides some privacy benefit.

On-Device LLMs: The Frontier

2023–2024 saw the first capable LLMs running on consumer devices:

Apple iPhone 15 Pro / M-series Macs: Apple Intelligence (2024) runs multiple LLM-based features on-device:

  • On-device model: ~3B parameters, INT4 quantized
  • Sensitive requests (personal email context) processed entirely on device
  • Server model used for complex requests with “Private Cloud Compute”

Qualcomm Snapdragon Elite: Claims real-time inference of 10B+ parameter models (INT4). Targets premium Android phones and Windows PCs for AI features.

llama.cpp on consumer hardware: 7B parameter models run at 30+ tokens/second on M1 MacBooks with 8GB RAM. 13B at reasonable speed on 16GB devices. Demonstrates the gap between cloud and edge capability is closing.

One thing to remember: Edge AI is not just about convenience — it’s increasingly a requirement for privacy-sensitive, latency-critical, and connectivity-independent applications, and dedicated AI hardware (NPUs, DSPs) is making it viable at increasing capability levels.

edge-ainputinymlapple-siliconon-device-mlmodel-optimization

See Also

  • Gpu Computing Why the graphics cards gamers use became the engine of the AI revolution — and how thousands of tiny processors working together changed what's computationally possible.
  • Kubernetes You built a toy factory with robots. Then business exploded and you need 50 factories. Kubernetes is the boss who makes sure all the robots stay busy — without you having to do anything.
  • Mlops Why getting an AI model to actually work in production is 10x harder than training it — and the engineering practices that make it reliable.