GPU Computing — Core Concepts
The CUDA Architecture
NVIDIA’s GPU architecture organizes computation in a hierarchy:
Thread: Executes one instruction. The basic unit. Warp: 32 threads executing the same instruction simultaneously (SIMT — Single Instruction Multiple Threads). Block: Multiple warps executing together, with shared memory access. Grid: Multiple blocks — the entire kernel launch.
A typical modern GPU (H100) has:
- 144 streaming multiprocessors (SMs)
- Each SM: 128 CUDA cores + 4 Tensor Core units + shared memory (228 KB)
- Total CUDA cores: 18,432
- Total FP32 throughput: 67 TFLOPS
- Total FP16 throughput: 134 TFLOPS
- Tensor Core throughput: 3.94 PFLOPS (bf16), 3.94 PFLOPS (FP16)
Memory hierarchy:
- Registers: per-thread, ~256 KB per SM, <1 ns latency
- Shared memory (L1): per-block, 228 KB per SM, ~5 ns latency
- L2 cache: 51 MB (H100 SXM), shared across SMs
- HBM3 (main GPU memory): 80 GB, ~400 ns, 3.35 TB/s bandwidth
Optimized CUDA code keeps data in registers and shared memory, minimizing HBM3 accesses. Memory bandwidth is often the bottleneck — more on this below.
Tensor Cores: Specialized Matrix Multiplication
Standard CUDA cores do general floating-point operations. Tensor Cores (introduced in Volta, 2017) are hardware units that perform matrix multiply-accumulate (MMA) operations in one clock cycle:
$$D = A \times B + C$$
Where A, B, C, D are 4×4 (Volta) or larger (Ampere, Hopper) matrices. Tensor Cores achieve 4–16x higher throughput than CUDA cores for matrix operations by operating on larger granularities.
Supported precisions in H100 Tensor Cores:
- FP64: 66 TFLOPS
- TF32 (19-bit): 494 TFLOPS
- FP16/BF16: 989 TFLOPS
- FP8: 1978 TFLOPS
- INT8: 1978 TOPS
The accuracy tradeoff: TF32 (NVIDIA’s default training precision since Ampere) uses 10-bit mantissa instead of 23-bit (FP32) — faster but slightly less precise. Mixed precision training uses FP16/BF16 for most operations and FP32 for gradient accumulation.
NVLink and GPU-to-GPU Communication
A single GPU’s memory (80 GB in H100) limits model size. Large models require multiple GPUs. The speed of inter-GPU communication becomes critical.
NVLink 4.0 (H100): 900 GB/s bidirectional bandwidth per GPU for GPU-to-GPU communication (vs. PCIe 5.0’s 128 GB/s).
NVSwitch: A network switch chip that enables all-to-all NVLink connectivity between up to 8 GPUs in a node (DGX H100). Each of the 8 GPUs can communicate with every other at 900 GB/s simultaneously. Total bisection bandwidth: 3.6 TB/s.
This high bandwidth enables tensor parallelism — splitting a single large matrix multiplication across multiple GPUs. Without fast interconnects, the communication overhead would exceed the computation savings.
Scale beyond a node: InfiniBand connects nodes (up to 400 Gb/s in HDR). Google TPUs use custom inter-chip interconnects. For training runs across thousands of GPUs, the network topology determines achievable parallelism efficiency.
The Three Parallelism Strategies
Training very large models (GPT-4 scale) requires all three:
Data Parallelism: Copy the model to N GPUs, split the batch across GPUs, average gradients after each step. Scales well; requires gradient synchronization (all-reduce) after each batch — becomes a communication bottleneck at scale.
Tensor Parallelism (Megatron-LM, Shoeybi et al., 2019): Split individual weight matrices across GPUs. For a $[d_{model} \times d_{model}]$ weight matrix across $N$ GPUs: each GPU holds $[d_{model} \times d_{model}/N]$ columns. Requires two all-reduce operations per layer forward/backward. Suitable within a node (NVLink bandwidth).
Pipeline Parallelism (GPipe, Huang et al., 2019): Assign different transformer layers to different GPUs. Micro-batching keeps all GPUs active simultaneously (though with some “pipeline bubble” waste). Suitable across nodes (lower communication bandwidth).
Modern systems like Megatron-DeepSpeed combine all three — 3D parallelism — to distribute both data and model across thousands of GPUs.
NVIDIA’s AI Hardware Dominance
In 2023, NVIDIA’s A100/H100 GPUs commanded 80%+ market share in AI training workloads. The monopoly has several mutually reinforcing causes:
CUDA ecosystem lock-in: 17 years of CUDA tooling — cuDNN (neural network primitives), cuBLAS (BLAS operations), NCCL (collective communications), and thousands of CUDA-optimized kernels in PyTorch, TensorFlow, and JAX. Switching to non-NVIDIA hardware requires reimplementing or porting all of this.
Aggressive roadmap: NVIDIA releases new GPU generations every 2–3 years with substantial performance improvements. The H100 delivered 6x performance improvement over A100 for transformer inference. Competitors struggle to catch up before the next generation.
Software integration: Deep integration with ML frameworks — PyTorch and JAX’s GPU support is primarily NVIDIA-first. AMD ROCm and Intel OneAPI lag in framework compatibility and performance.
Customer lock-in dynamics: Companies that build infrastructure around NVIDIA’s CUDA face high switching costs. Training runs take weeks; discovering performance differences mid-training is expensive.
The 2023 AI boom caused H100 prices to reach $40,000/unit and 12+ month wait times. NVIDIA’s GPU revenue grew from $15B (2022) to $47B (2023) — entirely driven by AI demand.
One thing to remember: GPU computing’s dominance in AI comes from a combination of raw parallel arithmetic performance and the CUDA software ecosystem — any challenger needs to match both, which is why NVIDIA has maintained dominance despite billions invested by AMD, Intel, Google, and others.
See Also
- Edge Ai Why AI is moving from cloud data centers to your devices — and what becomes possible when AI runs right where you are instead of sending your data far away.
- Kubernetes You built a toy factory with robots. Then business exploded and you need 50 factories. Kubernetes is the boss who makes sure all the robots stay busy — without you having to do anything.
- Mlops Why getting an AI model to actually work in production is 10x harder than training it — and the engineering practices that make it reliable.