Artificial Intelligence — Deep Dive
Overview
Modern AI — specifically large language models (LLMs) and multimodal foundation models — represents the convergence of three decades of research in statistical learning, neural network architecture, and distributed computing. This isn’t a survey of AI history. This is a technical breakdown of how today’s systems actually work, what makes them effective, what makes them fail, and where the hard problems remain.
The Transformer Architecture
The transformer, introduced in Vaswani et al.’s 2017 paper “Attention Is All You Need,” replaced recurrent neural networks (RNNs) as the dominant architecture for sequence modeling. The core innovation: self-attention.
Self-Attention Mechanism
In a traditional RNN, information flows sequentially — word 1 informs word 2, which informs word 3. Long-range dependencies (connecting a pronoun on page 3 to a name on page 1) degrade as the sequence grows. Self-attention computes relationships between every pair of tokens in the input simultaneously.
For a sequence of n tokens, self-attention:
- Projects each token into three vectors: Query (Q), Key (K), and Value (V)
- Computes attention scores:
Attention(Q, K, V) = softmax(QK^T / √d_k) V - The
softmax(QK^T / √d_k)term creates a weight matrix where each token’s attention to every other token is normalized
The √d_k scaling factor prevents the dot products from growing too large as dimensionality increases, which would push softmax into regions with vanishing gradients.
Multi-Head Attention
Rather than computing a single attention function, transformers use multiple “heads” — parallel attention computations with different learned projections. GPT-3 uses 96 attention heads. Each head can learn different relationship types: one head might track syntactic dependencies (subject-verb agreement), another might track semantic similarity, another might focus on positional proximity.
The Compute Problem
Self-attention’s pairwise computation is O(n²) in sequence length. For a 4,096-token context window, that’s ~16.7 million attention computations per layer per head. For a 128K context window (as in GPT-4 Turbo), it’s ~16.4 billion. This quadratic cost is the fundamental bottleneck for long-context models.
Solutions in active development:
- Flash Attention (Dao et al., 2022) — restructures attention computation to minimize GPU memory reads/writes, achieving 2-4x speedup without approximation
- Sparse attention — only compute attention between nearby tokens plus a set of “global” tokens (used in BigBird, Longformer)
- Ring Attention (2023) — distributes long sequences across multiple devices, computing attention blockwise
- State Space Models (Mamba, 2023) — replace attention entirely with selective state spaces that scale linearly with sequence length
Training: The Economics and Engineering
Pre-Training
Modern LLMs are pre-trained on massive text corpora using next-token prediction: given all previous tokens, predict the next one. This deceptively simple objective, at sufficient scale, produces emergent capabilities — reasoning, translation, coding — that weren’t explicitly trained for.
Training data scale:
| Model | Training Tokens | Parameters |
|---|---|---|
| GPT-3 (2020) | 300 billion | 175B |
| LLaMA 2 (2023) | 2 trillion | 70B |
| LLaMA 3 (2024) | 15 trillion | 405B |
| GPT-4 (2023) | Estimated 13T+ | Estimated 1.8T (MoE) |
Training compute costs have escalated dramatically. GPT-4’s training run reportedly cost over $100 million in compute alone. Meta’s LLaMA 3 405B required 30.8 million GPU-hours on NVIDIA H100s. By early 2026, frontier model training runs are estimated to exceed $500 million.
Scaling Laws
Kaplan et al. (2020) at OpenAI established that model performance (measured by loss) follows predictable power laws across three axes:
- Parameters (N): Loss ∝ N^(-0.076)
- Dataset size (D): Loss ∝ D^(-0.095)
- Compute (C): Loss ∝ C^(-0.050)
This means performance improves smoothly and predictably as you scale up — no sudden breakthroughs, no plateaus. The Chinchilla paper (Hoffmann et al., 2022) refined this: for compute-optimal training, parameters and data should scale roughly equally. A 70B-parameter model should train on ~1.4 trillion tokens. This shifted the industry from “bigger models” toward “more data for right-sized models.”
Post-Training Alignment
Raw pre-trained models are powerful but uncontrolled — they’ll happily generate toxic content, refuse nothing, and follow no instructions. Post-training aligns the model to be helpful, harmless, and honest:
-
Supervised Fine-Tuning (SFT): Train on human-written examples of ideal responses. Thousands of human annotators write high-quality answers to diverse prompts.
-
Reinforcement Learning from Human Feedback (RLHF): Train a reward model on human preferences (which of two responses is better?), then use PPO (Proximal Policy Optimization) to fine-tune the model to maximize that reward. This is how ChatGPT went from “impressive autocomplete” to “useful assistant.”
-
Constitutional AI (CAI): Anthropic’s approach — instead of extensive human labeling, the model critiques its own outputs against a set of principles and self-improves. Reduces reliance on human annotators for alignment.
-
Direct Preference Optimization (DPO): A 2023 alternative to RLHF that skips the reward model entirely, directly optimizing the policy from preference pairs. Simpler, more stable, increasingly popular.
Inference: Serving at Scale
Training happens once. Inference — generating responses for users — runs millions of times per day and dominates operational cost.
Key Inference Optimizations
Quantization: Reducing parameter precision from 32-bit floating point (FP32) to 8-bit integers (INT8) or 4-bit (INT4). A 70B-parameter model in FP32 needs 280GB of memory; in INT4, it fits in 35GB — runnable on two consumer GPUs. Quality loss is minimal for most tasks.
KV-Cache: During autoregressive generation, storing the Key and Value matrices from previous tokens so they don’t need recomputation. Without KV-cache, generating a 1,000-token response would recompute all attention from scratch 1,000 times.
Speculative Decoding: Use a small, fast “draft” model to generate candidate token sequences, then verify them in parallel with the large model. Since verification is cheaper than generation (it’s a single forward pass for multiple tokens), this can yield 2-3x speedup.
Mixture of Experts (MoE): Not all parameters activate for every token. GPT-4 reportedly uses a MoE architecture with ~1.8 trillion total parameters but only ~280 billion active per forward pass. This means inference cost scales with active parameters, not total parameters.
Cost of Inference
As of early 2026, API pricing reflects the real compute costs:
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 |
| Anthropic | Claude Sonnet | $3.00 | $15.00 |
| Gemini 2.0 Pro | $1.25 | $10.00 | |
| Meta | LLaMA 3.1 405B (self-hosted) | ~$1.00 | ~$4.00 |
Output tokens cost more because each requires a full forward pass, while input tokens can be processed in parallel.
Where AI Fails: Technical Limitations
Hallucination
LLMs generate statistically plausible text — not verified facts. A model trained on legal documents will produce convincing-sounding case citations that don’t exist (this happened publicly when a lawyer used ChatGPT in a 2023 court filing and submitted six fabricated cases). The fundamental issue: the training objective is “what word comes next,” not “what’s true.”
Retrieval-Augmented Generation (RAG) mitigates this by grounding generation in retrieved documents, but doesn’t eliminate it — the model can still hallucinate in how it interprets or summarizes retrieved content.
Reasoning vs. Pattern Completion
When GPT-4 solves a math problem, is it reasoning? The evidence is mixed. On standard benchmarks (GSM8K, MATH), frontier models perform impressively. But Dziri et al. (2023) showed that performance degrades sharply on problems requiring compositional reasoning steps beyond what appeared in training data. The models appear to memorize solution patterns, not learn general reasoning procedures.
Chain-of-thought prompting improves performance by forcing the model to externalize intermediate steps, essentially providing it with “working memory” in the context window. But this is a workaround, not a solution.
The Evaluation Problem
How do you measure whether a model is “good”? Traditional benchmarks are being saturated — GPT-4 scores 86.4% on the MMLU benchmark, approaching human expert performance. But benchmark scores don’t capture real-world usefulness, safety, or reliability. And models can be inadvertently trained on benchmark data (contamination), inflating scores.
New evaluation approaches include:
- Arena-style evaluation (Chatbot Arena / LMSYS) — head-to-head human preference voting
- Task-specific evaluation — SWE-bench for coding (can the model actually fix real GitHub issues?), MedQA for medical knowledge
- Red-teaming — adversarial testing for safety failures
Real-World Deployments
GitHub Copilot
Launched 2022, now used by over 1.8 million developers (as of 2024). Internal GitHub data shows it generates ~46% of code in files where it’s enabled. Built on OpenAI Codex (GPT-3.5 fine-tuned on code). Developers report 55% faster task completion on average, though the generated code requires review — it can introduce subtle bugs, security vulnerabilities, or outdated API usage.
Medical Imaging (Google Health)
Google’s dermatology AI (2021) matches board-certified dermatologists in identifying 26 skin conditions from photos. Med-PaLM 2 (2023) was the first AI to score at “expert level” on US Medical Licensing Exam questions. But deployment faces a harder problem than accuracy: regulatory approval, liability, and clinician trust. Most medical AI remains a “second opinion” tool, not a replacement.
Autonomous Vehicles
Waymo operates fully autonomous taxis in San Francisco, Phoenix, and Los Angeles — no safety driver. By late 2024, they were completing 150,000+ paid trips per week. The AI stack combines multiple models: perception (identifying objects from camera/lidar data), prediction (forecasting what other road users will do), and planning (deciding what the car should do). Waymo’s crash rate is 57% lower than the human baseline, per their 2024 safety report — though rare edge cases (unusual road configurations, adversarial behavior from other drivers) remain unsolved.
The Road to AGI: Open Problems
What’s Missing
Current AI systems lack several capabilities that human cognition handles effortlessly:
-
Persistent memory and learning — LLMs don’t learn from conversations (without fine-tuning). Every interaction starts from scratch. Humans continuously update their understanding.
-
Causal reasoning — Models learn correlations, not causation. They can tell you that umbrella sales and rain are correlated, but struggle with interventional questions: “If I hand out free umbrellas, will it rain?”
-
Embodied understanding — Language models have no physical experience. They can describe how to ride a bicycle but have never balanced, pedaled, or fallen. Robotics AI (like Google’s RT-2) is beginning to bridge this gap, but we’re early.
-
Efficient learning — A human child learns what a dog is from a handful of examples. GPT-4 needed trillions of tokens. Sample efficiency remains orders of magnitude worse than human learning.
Timeline Debate
Surveys of AI researchers show extreme disagreement. A 2023 survey of ~2,700 AI researchers found a median estimate of 2047 for “high-level machine intelligence” (50% probability), but individual estimates ranged from 2025 to “never.” The honest answer: nobody knows, and anyone who speaks with certainty about AGI timelines is selling something.
Further Reading
- Vaswani et al., “Attention Is All You Need” (2017) — the transformer paper
- Kaplan et al., “Scaling Laws for Neural Language Models” (2020)
- Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla, 2022)
- Dao et al., “FlashAttention” (2022)
- Anthropic, “Constitutional AI: Harmlessness from AI Feedback” (2022)
- Rafailov et al., “Direct Preference Optimization” (2023)
- Gu & Dao, “Mamba: Linear-Time Sequence Modeling” (2023)