Neural Networks — Deep Dive

From the math of backpropagation to the transformer architecture that powers GPT-4: a technical map of how modern neural networks actually work, and where they break.

The Mathematics of Learning

A neural network is, at its core, a parameterized function. Given input vector x, it produces output ŷ by composing many simpler functions through layers. The goal of training is to find parameter values (weights W and biases b) that minimize a loss function L(ŷ, y) over a training dataset.

A single neuron in a fully connected layer computes:

z = W · x + b
a = f(z)

Where f is an activation function. Stacking many such computations across layers gives you a deep network.

Activation Functions: More Than a Detail

The choice of activation function shapes what a network can learn. The universal approximation theorem (Cybenko, 1989) guarantees that a network with even one hidden layer and a non-linear activation can approximate any continuous function — but this says nothing about how efficiently it does so in practice.

Function	Formula	When Used
Sigmoid	1 / (1 + e⁻ˣ)	Output layer for binary classification
Tanh	(eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)	Hidden layers in RNNs
ReLU	max(0, x)	Default for deep hidden layers
GELU	x · Φ(x)	Transformer blocks (GPT, BERT)
Softmax	eˣⁱ / Σeˣʲ	Output layer for multi-class classification

ReLU’s dominance is largely practical: it doesn’t saturate for large positive values (avoiding vanishing gradients), is fast to compute, and produces sparse activations that help with generalization. However, “dying ReLU” (neurons stuck at zero) is a real failure mode, mitigated by Leaky ReLU or ELU variants.

Backpropagation in Detail

Backpropagation is the algorithm that computes gradients efficiently by applying the chain rule through the network’s computation graph.

During the forward pass, intermediate activations are stored. During the backward pass, you compute the gradient of the loss with respect to each weight:

∂L/∂W = ∂L/∂a · ∂a/∂z · ∂z/∂W

This chain telescopes backward through every layer. What makes it tractable is that each term can be computed locally — each layer only needs the gradient from the layer above, not the full network.

Modern frameworks (PyTorch, JAX, TensorFlow) implement automatic differentiation: they record the forward computation as a graph and reverse it automatically. You write the forward pass; the library derives the gradients.

Gradient Descent Variants

Vanilla stochastic gradient descent (SGD) is rarely used in practice. Key improvements:

Momentum: accumulates a velocity vector in the direction of consistent gradients, helping escape local plateaus.
Adam: keeps per-parameter running averages of both gradients and squared gradients. Adaptive learning rate per parameter. Default for most language models.
AdaFactor: memory-efficient alternative used in very large models (T5, PaLM).

The learning rate is arguably the most impactful hyperparameter. Too large: training diverges. Too small: training is slow and may get stuck. Modern practice uses learning rate schedules — warm-up phase, then cosine decay — rather than a single fixed value.

Convolutional Networks: Why Locality Works

CNNs replace the dense weight matrices of fully-connected layers with convolutional filters: small matrices (e.g., 3×3 or 5×5) that slide across the input, computing a dot product at each position. Two properties make this powerful:

Parameter sharing: the same filter applies everywhere in the image, massively reducing parameters compared to fully connected layers.
Translation invariance: a filter that detects horizontal edges works whether the edge is at the top or bottom of the image.

Pooling layers (max pool, average pool) reduce spatial dimensions, building in partial invariance to small shifts and reducing computation.

Residual connections (introduced in ResNet, 2015) allow the gradient to flow directly to early layers via skip connections:

output = F(x) + x

This solved the vanishing gradient problem for very deep networks (100+ layers) and is now nearly universal.

Transformers: The Architecture That Changed Everything

The transformer (Vaswani et al., “Attention Is All You Need”, 2017) displaced RNNs for sequence modeling and now underlies most frontier AI systems.

The core innovation is scaled dot-product attention:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where Q (query), K (key), and V (value) are linear projections of the input. Intuitively: for each position in the sequence, compute how relevant every other position is, and take a weighted sum of their values. This operates over the entire sequence simultaneously — no sequential dependence like RNNs.

Multi-head attention runs this mechanism in parallel with different projections, learning different relationship types simultaneously (syntax vs. semantics, local vs. global dependencies).

A transformer block stacks:

Multi-head self-attention
Add & LayerNorm (residual)
Feed-forward network (two linear layers with GELU)
Add & LayerNorm

GPT models are decoder-only transformers with causal masking (each token can only attend to previous tokens). BERT is encoder-only with bidirectional attention. T5 uses a full encoder-decoder architecture.

Scaling Laws

Kaplan et al. (2020, OpenAI) empirically showed that language model loss follows power laws with model size, dataset size, and compute budget — predictably. This gave the field a recipe: more parameters + more data + more compute = better models, with diminishing but consistent returns.

GPT-3 (175B parameters) validated this. GPT-4, likely in the range of hundreds of billions to over a trillion parameters with mixture-of-experts, pushed it further.

Regularization and Generalization

A network with millions of parameters can memorize its training data. Making it generalize to new data is the core challenge.

Key techniques:

Dropout: randomly zero out activations during training (typically 10–50%). Forces the network not to rely on any single neuron. Equivalent to training an ensemble of many smaller networks.
Weight decay (L2 regularization): penalizes large weights, biasing the network toward simpler solutions.
Batch normalization: normalizes activations within a mini-batch, stabilizing training and acting as mild regularization.
Data augmentation: artificially expand the training set with transformed versions of examples (flips, crops, color jitter for images; synonym replacement for text).
Early stopping: halt training when validation loss stops improving, before the model starts memorizing.

Known Failure Modes

Adversarial Examples

Neural networks can be fooled by imperceptible perturbations to input data. Adding a carefully crafted noise pattern (invisible to humans) to an image of a panda can make a classifier confidently output “gibbon” (Goodfellow et al., 2014). This isn’t a quirk of a few weak models — it’s a fundamental property of the high-dimensional geometry neural networks operate in. Adversarial robustness remains an open research problem.

Distribution Shift

A network trained on data from one distribution can fail silently when deployed in a slightly different context. ImageNet-trained classifiers that achieved 98% accuracy degraded significantly when tested on photos taken in different countries or with different camera hardware. Self-driving systems trained primarily in California weather performed worse in Michigan winters.

Shortcut Learning

Networks exploit correlations that work in the training set but don’t generalize. In one famous case, a skin cancer classifier was detecting the presence of rulers in photos (which doctors use when photographing suspicious lesions) rather than the lesion itself. The ruler correlated with “malignant” in training data because serious cases were more carefully photographed. The network learned the shortcut.

Hallucination in Language Models

Large language models generate fluent, confident text that can be factually wrong. The model learns to produce text that looks like correct answers — high-frequency, grammatically natural continuations — not to actually retrieve or verify facts. This is a structural issue, not a calibration bug.

The Interpretability Problem

Despite interpretability research (mechanistic interpretability, LIME, SHAP, saliency maps), neural networks remain largely opaque. You can identify which input pixels influenced a prediction, but not why the network learned to use them. Circuits-level analysis of small models has revealed recognizable components (curve detectors, frequency detectors in vision models), but this hasn’t scaled to production-size models.

This opacity creates real problems in regulated domains: credit scoring, medical diagnosis, criminal justice. The EU’s AI Act specifically requires explainability for high-risk AI systems, creating a regulatory gap that current technology cannot fully bridge.

Production Considerations

Running large neural networks at scale requires significant engineering:

Quantization: reduce precision from FP32 to FP16 or INT8, cutting memory and speeding inference with acceptable accuracy loss.
Pruning: remove weights close to zero. Sparse networks can retain most performance at 50–90% fewer parameters.
Distillation: train a small “student” network to mimic a large “teacher” network. DistilBERT achieves 97% of BERT’s performance at 40% the size.
KV-cache: transformer inference stores computed key/value matrices for previously processed tokens. Critical for making autoregressive generation fast enough for real-time use.

One Thing to Remember

Neural networks are not intelligent systems that understand the world — they are extraordinary pattern-matching engines that find statistical regularities in whatever data they were trained on. Their power and their failure modes both flow from the same source: they optimize aggressively for the patterns in their training distribution, whether those patterns reflect genuine structure or accidental noise.

techaineural-networksdeep-learningbackpropagationtransformers