GPU Computing — Core Concepts

CUDA architecture, tensor cores, memory hierarchy, GPU clusters with NVLink, and how NVIDIA achieved monopoly-like control over the AI hardware stack by 2023.

The CUDA Architecture

NVIDIA’s GPU architecture organizes computation in a hierarchy:

Thread: Executes one instruction. The basic unit. Warp: 32 threads executing the same instruction simultaneously (SIMT — Single Instruction Multiple Threads). Block: Multiple warps executing together, with shared memory access. Grid: Multiple blocks — the entire kernel launch.

A typical modern GPU (H100) has:

144 streaming multiprocessors (SMs)
Each SM: 128 CUDA cores + 4 Tensor Core units + shared memory (228 KB)
Total CUDA cores: 18,432
Total FP32 throughput: 67 TFLOPS
Total FP16 throughput: 134 TFLOPS
Tensor Core throughput: 3.94 PFLOPS (bf16), 3.94 PFLOPS (FP16)

Memory hierarchy:

Registers: per-thread, ~256 KB per SM, <1 ns latency
Shared memory (L1): per-block, 228 KB per SM, ~5 ns latency
L2 cache: 51 MB (H100 SXM), shared across SMs
HBM3 (main GPU memory): 80 GB, ~400 ns, 3.35 TB/s bandwidth

Optimized CUDA code keeps data in registers and shared memory, minimizing HBM3 accesses. Memory bandwidth is often the bottleneck — more on this below.

Tensor Cores: Specialized Matrix Multiplication

Standard CUDA cores do general floating-point operations. Tensor Cores (introduced in Volta, 2017) are hardware units that perform matrix multiply-accumulate (MMA) operations in one clock cycle:

$$D = A \times B + C$$

Where A, B, C, D are 4×4 (Volta) or larger (Ampere, Hopper) matrices. Tensor Cores achieve 4–16x higher throughput than CUDA cores for matrix operations by operating on larger granularities.

Supported precisions in H100 Tensor Cores:

FP64: 66 TFLOPS
TF32 (19-bit): 494 TFLOPS
FP16/BF16: 989 TFLOPS
FP8: 1978 TFLOPS
INT8: 1978 TOPS

The accuracy tradeoff: TF32 (NVIDIA’s default training precision since Ampere) uses 10-bit mantissa instead of 23-bit (FP32) — faster but slightly less precise. Mixed precision training uses FP16/BF16 for most operations and FP32 for gradient accumulation.

NVLink and GPU-to-GPU Communication

A single GPU’s memory (80 GB in H100) limits model size. Large models require multiple GPUs. The speed of inter-GPU communication becomes critical.

NVLink 4.0 (H100): 900 GB/s bidirectional bandwidth per GPU for GPU-to-GPU communication (vs. PCIe 5.0’s 128 GB/s).

NVSwitch: A network switch chip that enables all-to-all NVLink connectivity between up to 8 GPUs in a node (DGX H100). Each of the 8 GPUs can communicate with every other at 900 GB/s simultaneously. Total bisection bandwidth: 3.6 TB/s.

This high bandwidth enables tensor parallelism — splitting a single large matrix multiplication across multiple GPUs. Without fast interconnects, the communication overhead would exceed the computation savings.

Scale beyond a node: InfiniBand connects nodes (up to 400 Gb/s in HDR). Google TPUs use custom inter-chip interconnects. For training runs across thousands of GPUs, the network topology determines achievable parallelism efficiency.

The Three Parallelism Strategies

Training very large models (GPT-4 scale) requires all three:

Data Parallelism: Copy the model to N GPUs, split the batch across GPUs, average gradients after each step. Scales well; requires gradient synchronization (all-reduce) after each batch — becomes a communication bottleneck at scale.

Tensor Parallelism (Megatron-LM, Shoeybi et al., 2019): Split individual weight matrices across GPUs. For a $[d_{model} \times d_{model}]$ weight matrix across $N$ GPUs: each GPU holds $[d_{model} \times d_{model}/N]$ columns. Requires two all-reduce operations per layer forward/backward. Suitable within a node (NVLink bandwidth).

Pipeline Parallelism (GPipe, Huang et al., 2019): Assign different transformer layers to different GPUs. Micro-batching keeps all GPUs active simultaneously (though with some “pipeline bubble” waste). Suitable across nodes (lower communication bandwidth).

Modern systems like Megatron-DeepSpeed combine all three — 3D parallelism — to distribute both data and model across thousands of GPUs.

NVIDIA’s AI Hardware Dominance

In 2023, NVIDIA’s A100/H100 GPUs commanded 80%+ market share in AI training workloads. The monopoly has several mutually reinforcing causes:

CUDA ecosystem lock-in: 17 years of CUDA tooling — cuDNN (neural network primitives), cuBLAS (BLAS operations), NCCL (collective communications), and thousands of CUDA-optimized kernels in PyTorch, TensorFlow, and JAX. Switching to non-NVIDIA hardware requires reimplementing or porting all of this.

Aggressive roadmap: NVIDIA releases new GPU generations every 2–3 years with substantial performance improvements. The H100 delivered 6x performance improvement over A100 for transformer inference. Competitors struggle to catch up before the next generation.

Software integration: Deep integration with ML frameworks — PyTorch and JAX’s GPU support is primarily NVIDIA-first. AMD ROCm and Intel OneAPI lag in framework compatibility and performance.

Customer lock-in dynamics: Companies that build infrastructure around NVIDIA’s CUDA face high switching costs. Training runs take weeks; discovering performance differences mid-training is expensive.

The 2023 AI boom caused H100 prices to reach $40,000/unit and 12+ month wait times. NVIDIA’s GPU revenue grew from $15B (2022) to $47B (2023) — entirely driven by AI demand.

One thing to remember: GPU computing’s dominance in AI comes from a combination of raw parallel arithmetic performance and the CUDA software ecosystem — any challenger needs to match both, which is why NVIDIA has maintained dominance despite billions invested by AMD, Intel, Google, and others.

gpucudatensor-coresnvlinkh100ai-hardwarenvidia

GPU Computing — Core Concepts

The CUDA Architecture

Tensor Cores: Specialized Matrix Multiplication

NVLink and GPU-to-GPU Communication

The Three Parallelism Strategies

NVIDIA’s AI Hardware Dominance

See Also

Related Topics