GPU Computing — Deep Dive

Roofline model for GPU performance analysis, FlashAttention's tiling strategy, CUDA kernel optimization, HBM bandwidth limits, and the emerging AI accelerator landscape.

Roofline Model: When Are You Memory or Compute Bound?

The roofline model provides a visual tool for understanding hardware performance limits.

Arithmetic intensity $I$ = FLOPs / bytes of memory traffic

Peak performance is bounded by: $$P = \min(P_{max}, I \cdot B_{max})$$

Where $P_{max}$ is peak FLOPS and $B_{max}$ is memory bandwidth.

For H100 SXM:

$P_{max}$ = 989 TFLOPS (FP16 Tensor Cores)
$B_{max}$ = 3.35 TB/s HBM3

Ridge point: $I_{ridge} = P_{max} / B_{max} = 989 \times 10^{12} / 3.35 \times 10^{12} = 295$ FLOPs/byte

Interpretation: Operations with arithmetic intensity above 295 FLOPs/byte are compute-bound (limited by FLOPS). Below 295 FLOPs/byte are memory-bound (limited by bandwidth).

Matrix multiplication (large batch): $I \approx B/2$ where B is the matrix dimension. For $B=4096$: $I = 2048 >> 295$ → compute bound. Matrix multiplication at large scale fully utilizes tensor cores.

LayerNorm / elementwise ops: $I \approx O(1)$ FLOPs/byte → memory bound.

Attention (standard implementation): For sequence length $n$, $d_{model}$ dimensions:

Attention: $2n^2 d$ FLOPs, reads $n \cdot d$ (Q, K, V matrices) = $2n^2/d$ intensity
At $n=2048, d=128$: intensity = 32 → memory bound

This is why attention is slower than it theoretically needs to be: at common sequence lengths, it’s memory-limited.

FlashAttention: Tiling for Memory Efficiency

Dao et al. (2022, 2023) “FlashAttention” recognized that standard attention is memory-bound due to HBM read/write pattern. The solution: tiling to keep intermediate results in SRAM.

Standard attention:

Load Q, K from HBM → compute QK^T → write to HBM: $O(n^2 d)$ bytes
Load QK^T from HBM → softmax → write to HBM: $O(n^2)$ bytes
Load softmax(QK^T), V from HBM → compute attention output: $O(n^2 d)$ bytes

Total HBM traffic: $O(n^2 d)$

FlashAttention tiling:

Process blocks of Q, K, V that fit in SRAM (each SM has 228 KB)
For each Q block × K block: compute partial softmax + attention output in SRAM
Use the “online softmax” trick to update running max/sum without needing the full softmax row
Write final output to HBM once

HBM traffic: $O(n \cdot d)$ — no intermediate $n^2$ matrices written to HBM.

Performance: FlashAttention achieves 7.6x higher throughput than standard PyTorch attention on A100 GPUs for long sequences. This makes long-context models (32K, 128K, 1M tokens) practically feasible.

FlashAttention-2/3: Further optimizations for GPU architecture-specific tile sizes, warp scheduling, and Tensor Core utilization. FA3 specifically uses WGMMA (Warpgroup Matrix Multiply-Accumulate) instructions in Hopper (H100), achieving ~75% of H100’s FP16 peak for attention — compared to <25% for FA1 on A100.

CUDA Kernel Optimization Techniques

Writing efficient CUDA kernels requires exploiting the memory hierarchy:

Coalesced memory access: Threads in a warp should access contiguous memory addresses. If warp threads access addresses $[0, 4, 8, 16, …]$ (stride 4 bytes), only $1/4$ of bandwidth is utilized. Transposing access patterns or using shared memory for strided accesses recovers this.

Shared memory tiling: For matrix multiplication, load tiles of A and B into shared memory, multiply the tiles, repeat. Each element is loaded from HBM once (not $n$ times). Standard GEMM optimization.

Occupancy optimization: More concurrent warps per SM hides memory latency. Occupancy = active warps / maximum warps. Trade-off: more occupancy requires smaller register file per warp.

Register reuse: Holding frequently used values in registers rather than re-fetching from shared memory. Hand-written CUDA can schedule register reuse explicitly; compiler typically handles this.

Asynchronous data movement: H100’s TMA (Tensor Memory Accelerator) can move tiles between HBM and SRAM asynchronously — while SMs compute on one tile, TMA prefetches the next.

Alternative AI Accelerators

Google TPU v4: Custom ASIC optimized for int8/bfloat16 matrix operations. Uses a systolic array architecture. TPU v4 pods (4096 chips connected with 10 Pb/s total bandwidth) trained PaLM 2. Key advantage: high bandwidth inter-chip interconnects (ICI) vs. NVIDIA’s InfiniBand dependency.

Cerebras CS-2: “Wafer Scale Engine” — a single chip the size of a wafer, with 850,000 processing cores and 40 GB on-chip SRAM. No off-chip memory bottleneck for model sizes up to 40B parameters. Achieves memory-bandwidth-limited peak for most LLM operations.

Groq LPU (Language Processing Unit): Deterministic, compiler-scheduled execution — no branch prediction, no cache misses, pure throughput. Achieved record single-chip inference speed: 500+ tokens/second for 7B models. Limited flexibility; excellent for production inference on fixed workloads.

AMD MI300X: 192 GB HBM3 (vs. H100’s 80 GB) with competitive FP16 performance. ROCm (AMD’s CUDA alternative) now supports most major frameworks but with performance gaps in some operations. Meta and Microsoft deploying MI300X at scale for LLM inference.

The software gap: NVIDIA’s advantage is increasingly software, not just hardware. Vendor-specific optimizations in cuDNN (convolution algorithms), NCCL (collective operations), and Tensor RT (inference optimization) are years ahead of competitors. Bridging this software gap requires sustained ecosystem investment, which AMD (ROCm), Intel (OneAPI), and others are making but not yet closing.

One thing to remember: Understanding GPU performance requires understanding the roofline model — whether your workload is memory-bound or compute-bound determines which hardware specifications actually matter, and most deep learning inference is memory-bound, making HBM bandwidth often more important than raw FLOPS.

gpuroofline-modelflashattentioncuda-optimizationhbmai-acceleratorstpu

GPU Computing — Deep Dive

Roofline Model: When Are You Memory or Compute Bound?

FlashAttention: Tiling for Memory Efficiency

CUDA Kernel Optimization Techniques

Alternative AI Accelerators

See Also

Related Topics