Federated Learning — Deep Dive

FedAvg convergence analysis, gradient inversion attacks, differential privacy accounting, personalized FL with MAML and pFedMe, and the emerging cross-silo vs. cross-device taxonomy.

FedAvg Convergence: Theory and Limitations

The original FedAvg paper provided empirical evidence but no convergence guarantees. Subsequent theoretical work characterized when and why it works.

Li et al. (2020) proved that FedAvg converges for non-convex objectives under the following assumptions:

Gradients are $L$-smooth ($|\nabla f(x) - \nabla f(y)| \leq L|x-y|$)
Gradients are bounded: $|\nabla f_k(x)|^2 \leq G^2$
Gradient divergence is bounded: $\frac{1}{K}\sum_k |\nabla f_k(w) - \nabla f(w)|^2 \leq \Gamma$

The convergence rate is:

$$\frac{1}{T}\sum_{t=0}^{T-1} \mathbb{E}|\nabla f(w_t)|^2 \leq O\left(\frac{1}{\sqrt{TKE}} + \frac{\Gamma E}{\sqrt{T}}\right)$$

Key implications:

The first term: vanilla SGD convergence rate (diminishes with $T$, $K$ clients, $E$ local steps)
The second term: data heterogeneity penalty — scales with $\Gamma$ (how different clients are) and $E$ (number of local steps)

The fundamental tradeoff: More local steps $E$ reduces communication rounds (expensive) but increases client drift (hurts convergence). The optimal $E$ balances these.

With IID data ($\Gamma = 0$), FedAvg matches centralized training asymptotically. With highly non-IID data ($\Gamma \gg 0$), increasing local steps $E$ actually hurts — you want small $E$ (more frequent communication).

SCAFFOLD: Correcting Client Drift

FedProx adds a quadratic penalty to prevent drift, but SCAFFOLD (Karimireddy et al., 2020) addresses the root cause directly.

Each client $k$ maintains a control variate $c_k$ estimating its local gradient direction. The server maintains a global control variate $c$ estimating the global gradient direction.

Local update corrected for drift: $$g_k \leftarrow \nabla f_k(w) - c_k + c$$

The difference $c_k - c$ represents the client’s directional bias. By subtracting it, each client’s update is corrected toward the global gradient direction.

After local training, clients update their control variates: $$c_k^+ \leftarrow c_k - c + \frac{w - w^k}{E\eta}$$

SCAFFOLD achieves the same convergence rate as centralized SGD (no gradient divergence penalty), but requires two communication rounds per client (model update + control variate update).

Gradient Inversion Attacks

Zhu et al. (2019) “Deep Leakage from Gradients” demonstrated that training examples can be reconstructed from gradients with surprising fidelity. The attack: given gradients $\nabla W = \partial \mathcal{L}(x, y) / \partial W$, solve for $(x’, y’)$ that would produce the same gradients:

$$\min_{x’, y’} |\nabla W(x’, y’) - \nabla W|^2$$

Optimized via gradient descent on the input space. For small images (e.g., 32×32), this reconstruction is nearly perfect. For higher resolutions, approximate reconstruction with identifiable content is feasible.

Factors affecting attack success:

Batch size: Single-sample gradients are fully invertible; larger batches are harder (information is mixed). At batch size 16+, reconstruction quality degrades significantly.
Network depth: Deeper networks mix information more, making inversion harder
Gradient compression: Sparsified or quantized gradients reduce information available to the attacker

Defenses:

Gradient clipping alone is insufficient
Differential privacy with appropriate noise is the theoretically sound defense
Gradient compression (top-k sparsification) partially mitigates attacks as a side effect

Differential Privacy in FL: $(\epsilon, \delta)$-DP Accounting

Each round of federated learning with DP involves:

Clip each client’s update $g_k$ to $\ell_2$ norm $C$: $\hat{g}_k = g_k / \max(1, |g_k|_2 / C)$
Aggregate: $\bar{g} = \frac{1}{|S_t|}\sum_k \hat{g}_k$
Noise: $\tilde{g} = \bar{g} + \mathcal{N}(0, \sigma^2 C^2 / |S_t|^2 \cdot I)$

The privacy cost per round is characterized by the Gaussian mechanism: with noise multiplier $\sigma$ and subsampling ratio $q = |S_t|/N$, each round costs approximately $(\epsilon_0, \delta_0)$ privacy.

Over $T$ rounds, privacy composes. Moments Accountant (Abadi et al., 2016) and Rényi Differential Privacy (Mironov, 2017) provide tighter composition bounds than naive:

Using RDP composition: $\epsilon(T) = O\left(\frac{q\sqrt{T \log(1/\delta)}}{\sigma}\right)$

The model utility-privacy tradeoff:

Higher $\sigma$ → more noise → stronger privacy (lower $\epsilon$) → worse model accuracy
Larger $|S_t|$ → the noise is divided among more participants → better accuracy at same privacy cost
More training rounds → higher cumulative privacy cost

DP-SGD for FL: McMahan et al. (2018) showed that with appropriate $\sigma$ (~1.1 for $\epsilon \approx 8$ over millions of rounds), FL with DP achieves accuracy within 1-2% of non-private FL on tasks like next-word prediction.

Personalized Federated Learning

A key insight: the single global model may not be optimal for any individual client. Personalization relaxes the one-model-fits-all constraint.

Local fine-tuning (simplest): Take the global model and fine-tune on each client’s local data for inference. Effective when clients have moderate data; may overfit with small local datasets.

MAML-based FL (Per-FedAvg): Treat the global model as a meta-initialization. Each client can reach a good personalized model in $k$ gradient steps. The server trains specifically for this: optimizing not the global loss but the loss after one local adaptation step.

$$\min_w \sum_k F_k(w - \alpha \nabla F_k(w))$$

pFedMe: Each client $k$ solves: $$\min_{\theta_k} {F_k(\theta_k) + \frac{\lambda}{2}|\theta_k - w|^2}$$

Where $w$ is the global model. The $\lambda$ term prevents personalized models from diverging too far from the global model. The server aggregates global models; clients maintain personalized models locally.

Federated Split Learning: The model is split — early layers are client-specific (personalized), later layers are global (shared). This reduces communication (only the split point activations and gradients are communicated) while allowing personalization.

Cross-Silo vs. Cross-Device FL

The federated learning literature distinguishes two settings:

Cross-device FL:

Millions of clients (phones, IoT devices)
Each client participates rarely (once per weeks/months)
High system heterogeneity, unreliable connectivity
Small local datasets
Example: Google Gboard, Apple on-device learning

Cross-silo FL:

Few clients (10–100 institutions: hospitals, banks, companies)
Clients participate in every round
More reliable, higher-bandwidth connections
Large local datasets
Example: Medical imaging collaboratives, financial fraud detection

The two settings have very different algorithm requirements. Cross-device FL needs robust partial participation handling; cross-silo FL can use synchronous protocols with more complex cryptographic guarantees (multi-party computation, homomorphic encryption).

PySyft (OpenMined) and TensorFlow Federated are the primary open-source frameworks. Flower (Adap, 2020) provides a framework-agnostic FL server supporting both settings.

One thing to remember: Federated learning’s practical challenges — non-IID data, client drift, gradient privacy — each have principled solutions, but deploying all of them simultaneously requires careful engineering tradeoffs between privacy, accuracy, and communication cost.

federated-learningfedavgdifferential-privacygradient-inversionscaffoldpersonalization