Machine Learning — Deep Dive

From gradient descent internals to the bias-variance tradeoff, transformer architectures, and why scaling laws rewrote the rules of what's possible in ML.

Overview

Machine learning is, at its core, a numerical optimization problem. You define a function with millions (or billions) of adjustable parameters. You define a way to measure how wrong the function’s outputs are. You then run an algorithm that nudges those parameters in the direction that reduces wrongness, iteratively, until the function is good enough.

That description covers everything from logistic regression to GPT-4. The differences lie in the architecture of the function, the optimization strategy, and the data regime.

The Optimization Loop in Detail

Loss Functions

The loss function is how “wrong” is quantified. The choice of loss function shapes what the model optimizes for — and getting it wrong can produce models that score well on paper but fail in production.

Mean Squared Error (MSE): For regression tasks. Penalizes large errors heavily. Sensitive to outliers.
```
L = (1/n) * Σ(yᵢ - ŷᵢ)²
```
Cross-Entropy Loss: For classification. Measures the divergence between predicted probability distribution and the true label.
```
L = -Σ yᵢ * log(ŷᵢ)
```
When the model assigns probability 0.01 to the correct class, this loss is enormous. When it assigns 0.99, the loss is nearly zero. This drives the model toward confident, correct predictions.
RLHF Reward Models: In modern LLMs, a separate neural network scores outputs for human preference, and this score becomes the loss signal. This is how ChatGPT was fine-tuned to be helpful rather than just statistically likely.

Gradient Descent and Backpropagation

Gradient descent requires knowing which direction to nudge each parameter. With millions of parameters, computing this naively would be intractable. Backpropagation solves this via the chain rule of calculus — it propagates the error signal backward through the network layers, computing each parameter’s contribution to the total loss in a single backward pass.

Mini-batch SGD: Rather than computing gradients over the entire dataset (expensive) or a single sample (noisy), modern training uses mini-batches of 32–1024 samples. This gives a noisy but fast estimate of the gradient, with the noise providing a useful regularization effect.

Adaptive optimizers: Adam (Adaptive Moment Estimation), introduced by Kingma & Ba in 2014, maintains per-parameter learning rate estimates. It’s the default optimizer for most deep learning work because it converges faster and is less sensitive to the choice of learning rate.

Key Model Families

Decision Trees and Ensembles

Decision trees split data by features recursively: “Is age > 30? If yes, go left. Is income > 50k? If yes, leaf node: approve loan.”

Individual trees overfit easily. Random Forests mitigate this by training hundreds of trees on random subsets of data and features, then averaging their predictions. Gradient Boosted Trees (XGBoost, LightGBM) train trees sequentially, each correcting the errors of the previous. These remain the dominant method for tabular data in industry — most Kaggle competition winners on structured data use gradient boosting.

Neural Networks

Composed of layers of neurons, where each neuron computes a weighted sum of its inputs and passes the result through a nonlinear activation function (ReLU, sigmoid, tanh).

# One neuron, simplified
def neuron(inputs, weights, bias):
    z = sum(w * x for w, x in zip(weights, inputs)) + bias
    return max(0, z)  # ReLU activation

The depth (number of layers) enables hierarchical feature learning. Early layers in a vision network detect edges. Middle layers detect shapes. Later layers detect objects. This hierarchy emerges from training — it is not programmed.

Transformers

The architecture introduced in “Attention Is All You Need” (Vaswani et al., 2017) now dominates NLP and is spreading to vision, protein folding, and code generation.

The key innovation is the self-attention mechanism, which lets every token in a sequence attend to every other token, weighted by learned relevance:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where Q (queries), K (keys), and V (values) are linear projections of the input. The result is a context-aware representation of each token that incorporates information from the entire sequence — capturing long-range dependencies that RNNs struggled with.

Why it scaled: Transformers are highly parallelizable during training (unlike RNNs, which are sequential). This allowed training on far more data, which unlocked the scaling laws that changed the field.

The Bias-Variance Tradeoff

Every ML model navigates a fundamental tension:

High bias (underfitting): The model is too simple. It fails to capture the signal in training data. A linear model trying to fit a nonlinear relationship.
High variance (overfitting): The model is too complex. It captures noise as if it were signal. A 1000-node decision tree on 100 training examples.

The goal is the sweet spot: complex enough to learn the signal, simple enough to not memorize noise.

Bias and variance can be tuned via:

Model complexity (depth of network, number of parameters)
Regularization (L1/L2 penalties, dropout, early stopping)
Data augmentation (artificially expanding training data)
More data (the most reliable fix — more data typically reduces variance without increasing bias)

Scaling Laws and the Data Regime

A 2020 paper from OpenAI (Kaplan et al.) empirically showed that LLM performance follows power laws with respect to model size (N), dataset size (D), and compute (C). Specifically:

Loss ∝ N^(-0.076) × D^(-0.095) × C^(-0.050)

The key implication: given a fixed compute budget, there’s an optimal model size. Smaller models trained on more data often outperform larger models trained on less. Chinchilla (DeepMind, 2022) demonstrated this by showing GPT-3 was significantly undertrained — a model half the size, trained on 4× more data, matched or beat it on most benchmarks.

This reframed the scaling race: not just bigger models, but the right balance of model size and data.

Tradeoffs and Failure Modes

Distribution Shift

Models fail when the real-world data distribution differs from training data. A loan default model trained on 2015–2019 data struggled to predict defaults in 2020 because COVID created economic conditions outside its training distribution. There’s no algorithmic fix — only monitoring and retraining.

Spurious Correlations

Models learn any pattern that works during training, including ones that shouldn’t generalize. Stanford researchers found that chest X-ray models had learned to identify the scanner model (different hospitals used different machines) and use it as a shortcut. The model was right for the wrong reason.

Calibration

A model that says “70% confidence” on a prediction should be right 70% of the time. Many models are miscalibrated — they’re overconfident. This matters enormously in high-stakes domains. Temperature scaling (a post-hoc calibration technique) is now standard in medical ML deployments.

Interpretability vs. Performance

Gradient boosted trees are interpretable (you can trace the decision path). Deep neural networks are not — they’re black boxes. In regulated industries (healthcare, credit, hiring), this creates legal and ethical friction. LIME and SHAP are popular approximate explanation methods, but they explain approximations of the model, not the model itself.

Real-World Scale

Google Search uses ML for query understanding, result ranking, spam detection, and SafeSearch. Their 2015 switch from hand-tuned ranking formulas to neural models (RankBrain) was reportedly their biggest ranking change in years.
Meta’s recommendation system processes ~100 million feature interactions per prediction to decide which post appears in your feed. The ad auction runs ML inference on billions of (user, ad) pairs per day.
Tesla’s Autopilot uses a vision transformer to process 8 camera feeds simultaneously, outputting a birds-eye 3D representation of the car’s surroundings — a pure perception task with no lidar.
AlphaFold2 achieved median GDT_TS score of 92.4 on CASP14, a benchmark where 90+ is considered matching experimental accuracy. It has since released predicted structures for nearly every protein in UniProt (~200 million proteins), a resource that would have taken conventional methods centuries to produce.

Current State and Open Problems

What’s mostly solved: Image classification, speech recognition, machine translation, board games, protein structure prediction.

Active frontiers:

Reasoning and planning: LLMs can mimic reasoning but fail at multi-step logical tasks that require holding state.
Sample efficiency: Humans learn to recognize a dog from ~10 examples. ML models need thousands or millions. Few-shot and meta-learning approaches are closing this gap.
Robustness: Making models that fail gracefully rather than confidently wrong.
Multimodal grounding: Models that truly connect language to physical world experience, not just to other text.

Machine Learning — Deep Dive

Overview

The Optimization Loop in Detail

Loss Functions

Gradient Descent and Backpropagation

Key Model Families

Decision Trees and Ensembles

Neural Networks

Transformers

The Bias-Variance Tradeoff

Scaling Laws and the Data Regime

Tradeoffs and Failure Modes

Distribution Shift

Spurious Correlations

Calibration

Interpretability vs. Performance

Real-World Scale

Current State and Open Problems

Further Reading

See Also

Machine Learning — Deep Dive

Overview

The Optimization Loop in Detail

Loss Functions

Gradient Descent and Backpropagation

Key Model Families

Decision Trees and Ensembles

Neural Networks

Transformers

The Bias-Variance Tradeoff

Scaling Laws and the Data Regime

Tradeoffs and Failure Modes

Distribution Shift

Spurious Correlations

Calibration

Interpretability vs. Performance

Real-World Scale

Current State and Open Problems

Further Reading

See Also

Related Topics