GPU Computing — Deep Dive
Roofline Model: When Are You Memory or Compute Bound?
The roofline model provides a visual tool for understanding hardware performance limits.
Arithmetic intensity $I$ = FLOPs / bytes of memory traffic
Peak performance is bounded by: $$P = \min(P_{max}, I \cdot B_{max})$$
Where $P_{max}$ is peak FLOPS and $B_{max}$ is memory bandwidth.
For H100 SXM:
- $P_{max}$ = 989 TFLOPS (FP16 Tensor Cores)
- $B_{max}$ = 3.35 TB/s HBM3
Ridge point: $I_{ridge} = P_{max} / B_{max} = 989 \times 10^{12} / 3.35 \times 10^{12} = 295$ FLOPs/byte
Interpretation: Operations with arithmetic intensity above 295 FLOPs/byte are compute-bound (limited by FLOPS). Below 295 FLOPs/byte are memory-bound (limited by bandwidth).
Matrix multiplication (large batch): $I \approx B/2$ where B is the matrix dimension. For $B=4096$: $I = 2048 >> 295$ → compute bound. Matrix multiplication at large scale fully utilizes tensor cores.
LayerNorm / elementwise ops: $I \approx O(1)$ FLOPs/byte → memory bound.
Attention (standard implementation): For sequence length $n$, $d_{model}$ dimensions:
- Attention: $2n^2 d$ FLOPs, reads $n \cdot d$ (Q, K, V matrices) = $2n^2/d$ intensity
- At $n=2048, d=128$: intensity = 32 → memory bound
This is why attention is slower than it theoretically needs to be: at common sequence lengths, it’s memory-limited.
FlashAttention: Tiling for Memory Efficiency
Dao et al. (2022, 2023) “FlashAttention” recognized that standard attention is memory-bound due to HBM read/write pattern. The solution: tiling to keep intermediate results in SRAM.
Standard attention:
- Load Q, K from HBM → compute QK^T → write to HBM: $O(n^2 d)$ bytes
- Load QK^T from HBM → softmax → write to HBM: $O(n^2)$ bytes
- Load softmax(QK^T), V from HBM → compute attention output: $O(n^2 d)$ bytes
Total HBM traffic: $O(n^2 d)$
FlashAttention tiling:
- Process blocks of Q, K, V that fit in SRAM (each SM has 228 KB)
- For each Q block × K block: compute partial softmax + attention output in SRAM
- Use the “online softmax” trick to update running max/sum without needing the full softmax row
- Write final output to HBM once
HBM traffic: $O(n \cdot d)$ — no intermediate $n^2$ matrices written to HBM.
Performance: FlashAttention achieves 7.6x higher throughput than standard PyTorch attention on A100 GPUs for long sequences. This makes long-context models (32K, 128K, 1M tokens) practically feasible.
FlashAttention-2/3: Further optimizations for GPU architecture-specific tile sizes, warp scheduling, and Tensor Core utilization. FA3 specifically uses WGMMA (Warpgroup Matrix Multiply-Accumulate) instructions in Hopper (H100), achieving ~75% of H100’s FP16 peak for attention — compared to <25% for FA1 on A100.
CUDA Kernel Optimization Techniques
Writing efficient CUDA kernels requires exploiting the memory hierarchy:
Coalesced memory access: Threads in a warp should access contiguous memory addresses. If warp threads access addresses $[0, 4, 8, 16, …]$ (stride 4 bytes), only $1/4$ of bandwidth is utilized. Transposing access patterns or using shared memory for strided accesses recovers this.
Shared memory tiling: For matrix multiplication, load tiles of A and B into shared memory, multiply the tiles, repeat. Each element is loaded from HBM once (not $n$ times). Standard GEMM optimization.
Occupancy optimization: More concurrent warps per SM hides memory latency. Occupancy = active warps / maximum warps. Trade-off: more occupancy requires smaller register file per warp.
Register reuse: Holding frequently used values in registers rather than re-fetching from shared memory. Hand-written CUDA can schedule register reuse explicitly; compiler typically handles this.
Asynchronous data movement: H100’s TMA (Tensor Memory Accelerator) can move tiles between HBM and SRAM asynchronously — while SMs compute on one tile, TMA prefetches the next.
Alternative AI Accelerators
Google TPU v4: Custom ASIC optimized for int8/bfloat16 matrix operations. Uses a systolic array architecture. TPU v4 pods (4096 chips connected with 10 Pb/s total bandwidth) trained PaLM 2. Key advantage: high bandwidth inter-chip interconnects (ICI) vs. NVIDIA’s InfiniBand dependency.
Cerebras CS-2: “Wafer Scale Engine” — a single chip the size of a wafer, with 850,000 processing cores and 40 GB on-chip SRAM. No off-chip memory bottleneck for model sizes up to 40B parameters. Achieves memory-bandwidth-limited peak for most LLM operations.
Groq LPU (Language Processing Unit): Deterministic, compiler-scheduled execution — no branch prediction, no cache misses, pure throughput. Achieved record single-chip inference speed: 500+ tokens/second for 7B models. Limited flexibility; excellent for production inference on fixed workloads.
AMD MI300X: 192 GB HBM3 (vs. H100’s 80 GB) with competitive FP16 performance. ROCm (AMD’s CUDA alternative) now supports most major frameworks but with performance gaps in some operations. Meta and Microsoft deploying MI300X at scale for LLM inference.
The software gap: NVIDIA’s advantage is increasingly software, not just hardware. Vendor-specific optimizations in cuDNN (convolution algorithms), NCCL (collective operations), and Tensor RT (inference optimization) are years ahead of competitors. Bridging this software gap requires sustained ecosystem investment, which AMD (ROCm), Intel (OneAPI), and others are making but not yet closing.
One thing to remember: Understanding GPU performance requires understanding the roofline model — whether your workload is memory-bound or compute-bound determines which hardware specifications actually matter, and most deep learning inference is memory-bound, making HBM bandwidth often more important than raw FLOPS.
See Also
- Edge Ai Why AI is moving from cloud data centers to your devices — and what becomes possible when AI runs right where you are instead of sending your data far away.
- Kubernetes You built a toy factory with robots. Then business exploded and you need 50 factories. Kubernetes is the boss who makes sure all the robots stay busy — without you having to do anything.
- Mlops Why getting an AI model to actually work in production is 10x harder than training it — and the engineering practices that make it reliable.