Machine Learning — Deep Dive
Overview
Machine learning is, at its core, a numerical optimization problem. You define a function with millions (or billions) of adjustable parameters. You define a way to measure how wrong the function’s outputs are. You then run an algorithm that nudges those parameters in the direction that reduces wrongness, iteratively, until the function is good enough.
That description covers everything from logistic regression to GPT-4. The differences lie in the architecture of the function, the optimization strategy, and the data regime.
The Optimization Loop in Detail
Loss Functions
The loss function is how “wrong” is quantified. The choice of loss function shapes what the model optimizes for — and getting it wrong can produce models that score well on paper but fail in production.
-
Mean Squared Error (MSE): For regression tasks. Penalizes large errors heavily. Sensitive to outliers.
L = (1/n) * Σ(yᵢ - ŷᵢ)² -
Cross-Entropy Loss: For classification. Measures the divergence between predicted probability distribution and the true label.
L = -Σ yᵢ * log(ŷᵢ)When the model assigns probability 0.01 to the correct class, this loss is enormous. When it assigns 0.99, the loss is nearly zero. This drives the model toward confident, correct predictions.
-
RLHF Reward Models: In modern LLMs, a separate neural network scores outputs for human preference, and this score becomes the loss signal. This is how ChatGPT was fine-tuned to be helpful rather than just statistically likely.
Gradient Descent and Backpropagation
Gradient descent requires knowing which direction to nudge each parameter. With millions of parameters, computing this naively would be intractable. Backpropagation solves this via the chain rule of calculus — it propagates the error signal backward through the network layers, computing each parameter’s contribution to the total loss in a single backward pass.
Mini-batch SGD: Rather than computing gradients over the entire dataset (expensive) or a single sample (noisy), modern training uses mini-batches of 32–1024 samples. This gives a noisy but fast estimate of the gradient, with the noise providing a useful regularization effect.
Adaptive optimizers: Adam (Adaptive Moment Estimation), introduced by Kingma & Ba in 2014, maintains per-parameter learning rate estimates. It’s the default optimizer for most deep learning work because it converges faster and is less sensitive to the choice of learning rate.
Key Model Families
Decision Trees and Ensembles
Decision trees split data by features recursively: “Is age > 30? If yes, go left. Is income > 50k? If yes, leaf node: approve loan.”
Individual trees overfit easily. Random Forests mitigate this by training hundreds of trees on random subsets of data and features, then averaging their predictions. Gradient Boosted Trees (XGBoost, LightGBM) train trees sequentially, each correcting the errors of the previous. These remain the dominant method for tabular data in industry — most Kaggle competition winners on structured data use gradient boosting.
Neural Networks
Composed of layers of neurons, where each neuron computes a weighted sum of its inputs and passes the result through a nonlinear activation function (ReLU, sigmoid, tanh).
# One neuron, simplified
def neuron(inputs, weights, bias):
z = sum(w * x for w, x in zip(weights, inputs)) + bias
return max(0, z) # ReLU activation
The depth (number of layers) enables hierarchical feature learning. Early layers in a vision network detect edges. Middle layers detect shapes. Later layers detect objects. This hierarchy emerges from training — it is not programmed.
Transformers
The architecture introduced in “Attention Is All You Need” (Vaswani et al., 2017) now dominates NLP and is spreading to vision, protein folding, and code generation.
The key innovation is the self-attention mechanism, which lets every token in a sequence attend to every other token, weighted by learned relevance:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Where Q (queries), K (keys), and V (values) are linear projections of the input. The result is a context-aware representation of each token that incorporates information from the entire sequence — capturing long-range dependencies that RNNs struggled with.
Why it scaled: Transformers are highly parallelizable during training (unlike RNNs, which are sequential). This allowed training on far more data, which unlocked the scaling laws that changed the field.
The Bias-Variance Tradeoff
Every ML model navigates a fundamental tension:
- High bias (underfitting): The model is too simple. It fails to capture the signal in training data. A linear model trying to fit a nonlinear relationship.
- High variance (overfitting): The model is too complex. It captures noise as if it were signal. A 1000-node decision tree on 100 training examples.
The goal is the sweet spot: complex enough to learn the signal, simple enough to not memorize noise.
Bias and variance can be tuned via:
- Model complexity (depth of network, number of parameters)
- Regularization (L1/L2 penalties, dropout, early stopping)
- Data augmentation (artificially expanding training data)
- More data (the most reliable fix — more data typically reduces variance without increasing bias)
Scaling Laws and the Data Regime
A 2020 paper from OpenAI (Kaplan et al.) empirically showed that LLM performance follows power laws with respect to model size (N), dataset size (D), and compute (C). Specifically:
Loss ∝ N^(-0.076) × D^(-0.095) × C^(-0.050)
The key implication: given a fixed compute budget, there’s an optimal model size. Smaller models trained on more data often outperform larger models trained on less. Chinchilla (DeepMind, 2022) demonstrated this by showing GPT-3 was significantly undertrained — a model half the size, trained on 4× more data, matched or beat it on most benchmarks.
This reframed the scaling race: not just bigger models, but the right balance of model size and data.
Tradeoffs and Failure Modes
Distribution Shift
Models fail when the real-world data distribution differs from training data. A loan default model trained on 2015–2019 data struggled to predict defaults in 2020 because COVID created economic conditions outside its training distribution. There’s no algorithmic fix — only monitoring and retraining.
Spurious Correlations
Models learn any pattern that works during training, including ones that shouldn’t generalize. Stanford researchers found that chest X-ray models had learned to identify the scanner model (different hospitals used different machines) and use it as a shortcut. The model was right for the wrong reason.
Calibration
A model that says “70% confidence” on a prediction should be right 70% of the time. Many models are miscalibrated — they’re overconfident. This matters enormously in high-stakes domains. Temperature scaling (a post-hoc calibration technique) is now standard in medical ML deployments.
Interpretability vs. Performance
Gradient boosted trees are interpretable (you can trace the decision path). Deep neural networks are not — they’re black boxes. In regulated industries (healthcare, credit, hiring), this creates legal and ethical friction. LIME and SHAP are popular approximate explanation methods, but they explain approximations of the model, not the model itself.
Real-World Scale
-
Google Search uses ML for query understanding, result ranking, spam detection, and SafeSearch. Their 2015 switch from hand-tuned ranking formulas to neural models (RankBrain) was reportedly their biggest ranking change in years.
-
Meta’s recommendation system processes ~100 million feature interactions per prediction to decide which post appears in your feed. The ad auction runs ML inference on billions of (user, ad) pairs per day.
-
Tesla’s Autopilot uses a vision transformer to process 8 camera feeds simultaneously, outputting a birds-eye 3D representation of the car’s surroundings — a pure perception task with no lidar.
-
AlphaFold2 achieved median GDT_TS score of 92.4 on CASP14, a benchmark where 90+ is considered matching experimental accuracy. It has since released predicted structures for nearly every protein in UniProt (~200 million proteins), a resource that would have taken conventional methods centuries to produce.
Current State and Open Problems
What’s mostly solved: Image classification, speech recognition, machine translation, board games, protein structure prediction.
Active frontiers:
- Reasoning and planning: LLMs can mimic reasoning but fail at multi-step logical tasks that require holding state.
- Sample efficiency: Humans learn to recognize a dog from ~10 examples. ML models need thousands or millions. Few-shot and meta-learning approaches are closing this gap.
- Robustness: Making models that fail gracefully rather than confidently wrong.
- Multimodal grounding: Models that truly connect language to physical world experience, not just to other text.
Further Reading
- The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) — the rigorous classical foundation
- Attention Is All You Need (Vaswani et al., 2017) — the transformer paper
- Scaling Laws for Neural Language Models (Kaplan et al., 2020) — the paper that shaped GPT-3/4 decisions
- A Recipe for Training Neural Networks (Andrej Karpathy, 2019) — practical debugging wisdom
- fast.ai — top-down practical ML education with modern results
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.