Federated Learning — Core Concepts

Google's privacy-preserving ML technique explained: FedAvg algorithm, communication rounds, non-IID data challenges, and why hospitals are using this for medical AI.

The Central Problem With Standard ML Training

Training a machine learning model requires data — lots of it. The standard approach: centralize as much data as possible, train on it all. This has been remarkably effective but has significant costs:

Privacy risk: Centralized data is a high-value target for breaches
Regulatory barriers: GDPR, HIPAA, and other regulations restrict moving sensitive data
Communication cost: Moving petabytes of data to central servers is expensive
Trust barriers: Organizations (hospitals, banks, competing companies) won’t share data with each other even when a joint model would benefit everyone

Federated Learning, introduced by Google in 2016 (McMahan et al., “Communication-Efficient Learning of Deep Networks from Decentralized Data”), inverts the paradigm: instead of moving data to the model, move the model to the data.

How It Works: FedAvg

The canonical algorithm is Federated Averaging (FedAvg):

Round initialization: The central server holds a global model $w_t$.

Client selection: A subset $S_t$ of available clients is selected (e.g., 100 out of 1 million phones that are plugged in and on Wi-Fi).

Local training: Each selected client $k$ downloads the global model $w_t$ and trains locally on its own data for $E$ epochs with local learning rate $\eta$: $$w_t^k \leftarrow w_t - \eta \nabla \mathcal{L}_k(w_t)$$

Aggregation: Each client sends back only the updated weights $w_t^k$ (not the training data). The server aggregates: $$w_{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_t^k$$

Where $n_k$ is the number of data points on client $k$ and $n = \sum_{k \in S_t} n_k$. The weighted average favors clients with more data.

This completes one communication round. Training proceeds over many rounds — typically hundreds to thousands.

The Non-IID Data Problem

Standard distributed machine learning assumes data is IID — independently and identically distributed across clients. Federated learning explicitly violates this: each device has data reflecting the behavior of one person or institution.

A phone keyboard dataset might have:

Client A: exclusively texts in Spanish
Client B: uses lots of medical terminology
Client C: heavy emoji user with minimal text

Each client’s data is a biased sample of the overall population. This non-IID property causes FedAvg to converge to a worse solution than centralized training (or diverge entirely with high local learning rates).

Quantifying the problem: FedAvg’s convergence bound degrades proportionally to how different the clients’ gradients are from the global gradient — formally measured as gradient divergence $\sum_k \frac{n_k}{n} |\nabla \mathcal{L}_k(w) - \nabla \mathcal{L}(w)|^2$.

Practical mitigations:

FedProx: Adds a proximal term to each client’s local objective that penalizes divergence from the global model. Prevents clients from updating too aggressively in their local direction.
SCAFFOLD: Uses control variates to correct for client drift — each client tracks the difference between its local update direction and the global direction.
Personalized FL: Instead of one global model, each client learns a personalized model that blends global and local learning.

System Heterogeneity

Real federated learning deployments face massive system heterogeneity:

Device diversity: Clients range from high-end phones to 4-year-old budget devices. Local training on slow devices can’t complete in the same time window as fast devices.

Stragglers: Some clients disconnect mid-round (battery dies, network drops). Standard synchronous aggregation would wait for all clients, creating huge delays.

Partial participation: Google’s Gboard uses “cross-device federated learning” across millions of phones, but only a few thousand participate in any given round. The algorithm must work with this tiny, changing sample.

Solutions: Asynchronous aggregation (don’t wait for all clients), device tier filtering (only use devices above a minimum capability threshold), and careful client selection to balance participation.

Privacy: What Federated Learning Does and Doesn’t Provide

The intuitive privacy guarantee — “your data never leaves your device” — is real but weaker than it sounds:

What FL protects against: A passive server that only sees model updates cannot directly reconstruct individual training examples. This is better than sending raw data.

What FL doesn’t protect against: Sophisticated attacks on model updates. Phong et al. (2017) demonstrated gradient inversion attacks — reconstructing training examples from gradients. Zhu et al. (2019) showed high-fidelity image reconstruction from gradients. A malicious server can potentially extract private information from updates.

Differential Privacy (DP): The gold standard privacy addition to FL. Each client clips gradients to bounded magnitude, then adds calibrated Gaussian noise before sending:

$$\tilde{g}_k = \text{clip}(g_k, C) + \mathcal{N}(0, \sigma^2 C^2)$$

This provides $(\epsilon, \delta)$-differential privacy: the probability of any outcome differs by at most $e^\epsilon$ between runs with and without any individual’s data. The tradeoff: more privacy (lower $\epsilon$) requires more noise, which hurts model accuracy.

Secure Aggregation: Cryptographic protocols allow the server to learn only the aggregate of client updates, not individual updates. Used in combination with DP for stronger protection.

Real-World Deployments

Google Gboard (2017): Next-word prediction, query suggestions. First large-scale production federated learning deployment. The model on your Android keyboard was trained partially by your phone without your messages leaving it.

Apple (iOS)14+): On-device intelligence for Siri, keyboard, and health features uses both federated learning and on-device learning that never leaves the device.

Healthcare: The FeTS (Federated Tumor Segmentation) project trained brain tumor segmentation models across 71 institutions in 6 countries — without any hospital sharing patient images with any other hospital. Published in Nature Medicine, 2022.

Financial services: Banks training fraud detection models collaboratively across institutions without sharing transaction records — each bank contributes without seeing competitors’ data.

One thing to remember: Federated learning makes privacy-preserving collaborative training possible, but the privacy is probabilistic, not absolute — differential privacy adds the mathematical guarantees, at a cost to model quality.

federated-learningprivacydistributed-mlfedavgdifferential-privacy