Federated Learning — Core Concepts
The Central Problem With Standard ML Training
Training a machine learning model requires data — lots of it. The standard approach: centralize as much data as possible, train on it all. This has been remarkably effective but has significant costs:
- Privacy risk: Centralized data is a high-value target for breaches
- Regulatory barriers: GDPR, HIPAA, and other regulations restrict moving sensitive data
- Communication cost: Moving petabytes of data to central servers is expensive
- Trust barriers: Organizations (hospitals, banks, competing companies) won’t share data with each other even when a joint model would benefit everyone
Federated Learning, introduced by Google in 2016 (McMahan et al., “Communication-Efficient Learning of Deep Networks from Decentralized Data”), inverts the paradigm: instead of moving data to the model, move the model to the data.
How It Works: FedAvg
The canonical algorithm is Federated Averaging (FedAvg):
Round initialization: The central server holds a global model $w_t$.
Client selection: A subset $S_t$ of available clients is selected (e.g., 100 out of 1 million phones that are plugged in and on Wi-Fi).
Local training: Each selected client $k$ downloads the global model $w_t$ and trains locally on its own data for $E$ epochs with local learning rate $\eta$: $$w_t^k \leftarrow w_t - \eta \nabla \mathcal{L}_k(w_t)$$
Aggregation: Each client sends back only the updated weights $w_t^k$ (not the training data). The server aggregates: $$w_{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_t^k$$
Where $n_k$ is the number of data points on client $k$ and $n = \sum_{k \in S_t} n_k$. The weighted average favors clients with more data.
This completes one communication round. Training proceeds over many rounds — typically hundreds to thousands.
The Non-IID Data Problem
Standard distributed machine learning assumes data is IID — independently and identically distributed across clients. Federated learning explicitly violates this: each device has data reflecting the behavior of one person or institution.
A phone keyboard dataset might have:
- Client A: exclusively texts in Spanish
- Client B: uses lots of medical terminology
- Client C: heavy emoji user with minimal text
Each client’s data is a biased sample of the overall population. This non-IID property causes FedAvg to converge to a worse solution than centralized training (or diverge entirely with high local learning rates).
Quantifying the problem: FedAvg’s convergence bound degrades proportionally to how different the clients’ gradients are from the global gradient — formally measured as gradient divergence $\sum_k \frac{n_k}{n} |\nabla \mathcal{L}_k(w) - \nabla \mathcal{L}(w)|^2$.
Practical mitigations:
- FedProx: Adds a proximal term to each client’s local objective that penalizes divergence from the global model. Prevents clients from updating too aggressively in their local direction.
- SCAFFOLD: Uses control variates to correct for client drift — each client tracks the difference between its local update direction and the global direction.
- Personalized FL: Instead of one global model, each client learns a personalized model that blends global and local learning.
System Heterogeneity
Real federated learning deployments face massive system heterogeneity:
Device diversity: Clients range from high-end phones to 4-year-old budget devices. Local training on slow devices can’t complete in the same time window as fast devices.
Stragglers: Some clients disconnect mid-round (battery dies, network drops). Standard synchronous aggregation would wait for all clients, creating huge delays.
Partial participation: Google’s Gboard uses “cross-device federated learning” across millions of phones, but only a few thousand participate in any given round. The algorithm must work with this tiny, changing sample.
Solutions: Asynchronous aggregation (don’t wait for all clients), device tier filtering (only use devices above a minimum capability threshold), and careful client selection to balance participation.
Privacy: What Federated Learning Does and Doesn’t Provide
The intuitive privacy guarantee — “your data never leaves your device” — is real but weaker than it sounds:
What FL protects against: A passive server that only sees model updates cannot directly reconstruct individual training examples. This is better than sending raw data.
What FL doesn’t protect against: Sophisticated attacks on model updates. Phong et al. (2017) demonstrated gradient inversion attacks — reconstructing training examples from gradients. Zhu et al. (2019) showed high-fidelity image reconstruction from gradients. A malicious server can potentially extract private information from updates.
Differential Privacy (DP): The gold standard privacy addition to FL. Each client clips gradients to bounded magnitude, then adds calibrated Gaussian noise before sending:
$$\tilde{g}_k = \text{clip}(g_k, C) + \mathcal{N}(0, \sigma^2 C^2)$$
This provides $(\epsilon, \delta)$-differential privacy: the probability of any outcome differs by at most $e^\epsilon$ between runs with and without any individual’s data. The tradeoff: more privacy (lower $\epsilon$) requires more noise, which hurts model accuracy.
Secure Aggregation: Cryptographic protocols allow the server to learn only the aggregate of client updates, not individual updates. Used in combination with DP for stronger protection.
Real-World Deployments
Google Gboard (2017): Next-word prediction, query suggestions. First large-scale production federated learning deployment. The model on your Android keyboard was trained partially by your phone without your messages leaving it.
Apple (iOS)14+): On-device intelligence for Siri, keyboard, and health features uses both federated learning and on-device learning that never leaves the device.
Healthcare: The FeTS (Federated Tumor Segmentation) project trained brain tumor segmentation models across 71 institutions in 6 countries — without any hospital sharing patient images with any other hospital. Published in Nature Medicine, 2022.
Financial services: Banks training fraud detection models collaboratively across institutions without sharing transaction records — each bank contributes without seeing competitors’ data.
One thing to remember: Federated learning makes privacy-preserving collaborative training possible, but the privacy is probabilistic, not absolute — differential privacy adds the mathematical guarantees, at a cost to model quality.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'