Differential Privacy in Python — Deep Dive
Implementing basic mechanisms from scratch
Before using libraries, understanding the raw mechanics helps you reason about privacy guarantees.
import numpy as np
from typing import Callable
class LaplaceMechanism:
"""Add Laplace noise calibrated to sensitivity and epsilon."""
def __init__(self, epsilon: float, sensitivity: float = 1.0):
self.epsilon = epsilon
self.sensitivity = sensitivity
self.scale = sensitivity / epsilon
def release(self, true_value: float) -> float:
noise = np.random.laplace(0, self.scale)
return true_value + noise
def release_count(self, true_count: int) -> float:
"""For counting queries, sensitivity is always 1."""
noise = np.random.laplace(0, 1.0 / self.epsilon)
return true_count + noise
class GaussianMechanism:
"""Add Gaussian noise for (epsilon, delta)-DP."""
def __init__(self, epsilon: float, delta: float, sensitivity: float = 1.0):
self.epsilon = epsilon
self.delta = delta
self.sensitivity = sensitivity
# Calibrate sigma using the analytic Gaussian mechanism
self.sigma = (
sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
)
def release(self, true_value: float) -> float:
noise = np.random.normal(0, self.sigma)
return true_value + noise
# Example: private count
mechanism = LaplaceMechanism(epsilon=1.0)
true_count = 42857 # actual number of users who clicked
private_count = mechanism.release_count(true_count)
# private_count ≈ 42857 ± ~1 (noise is tiny relative to large counts)
Privacy budget accounting
A production system needs a budget tracker that prevents overspending:
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class PrivacyBudget:
total_epsilon: float
total_delta: float = 1e-5
spent_epsilon: float = 0.0
spent_delta: float = 0.0
queries: list[dict] = field(default_factory=list)
@property
def remaining_epsilon(self) -> float:
return self.total_epsilon - self.spent_epsilon
def can_afford(self, epsilon: float, delta: float = 0.0) -> bool:
return (
self.spent_epsilon + epsilon <= self.total_epsilon
and self.spent_delta + delta <= self.total_delta
)
def spend(self, epsilon: float, delta: float = 0.0, description: str = "") -> None:
if not self.can_afford(epsilon, delta):
raise PrivacyBudgetExhausted(
f"Requested ε={epsilon}, δ={delta} but only "
f"ε={self.remaining_epsilon:.4f} remaining"
)
self.spent_epsilon += epsilon
self.spent_delta += delta
self.queries.append({
"epsilon": epsilon,
"delta": delta,
"description": description,
"timestamp": datetime.utcnow().isoformat(),
})
class PrivacyBudgetExhausted(Exception):
pass
class PrivateQueryEngine:
"""Execute queries with automatic budget tracking."""
def __init__(self, budget: PrivacyBudget):
self.budget = budget
def private_count(self, data, predicate, epsilon: float) -> float:
true_count = sum(1 for item in data if predicate(item))
self.budget.spend(epsilon, description=f"count query")
mechanism = LaplaceMechanism(epsilon=epsilon, sensitivity=1.0)
return mechanism.release(true_count)
def private_sum(
self, data, value_fn, epsilon: float,
lower_bound: float, upper_bound: float,
) -> float:
# Clamp contributions to bound sensitivity
clamped = [
max(lower_bound, min(upper_bound, value_fn(item)))
for item in data
]
true_sum = sum(clamped)
sensitivity = upper_bound - lower_bound
self.budget.spend(epsilon, description=f"sum query (sensitivity={sensitivity})")
mechanism = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity)
return mechanism.release(true_sum)
def private_mean(
self, data, value_fn, epsilon: float,
lower_bound: float, upper_bound: float,
) -> float:
"""Split budget between count and sum, then divide."""
eps_count = epsilon / 2
eps_sum = epsilon / 2
n = len(data)
noisy_count = self.private_count(
data, lambda _: True, eps_count
)
# Undo the count we just did and do a sum instead
self.budget.spent_epsilon -= eps_count # rollback for recount
self.budget.queries.pop()
noisy_sum = self.private_sum(
data, value_fn, eps_sum, lower_bound, upper_bound
)
noisy_n = max(1, noisy_count) # avoid division by zero
self.budget.spend(eps_count, description="count for mean")
return noisy_sum / noisy_n
Using OpenDP
OpenDP is a Rust-backed Python library providing vetted differential privacy primitives with formal proofs:
# pip install opendp
import opendp.prelude as dp
dp.enable_features("contrib", "honest-but-curious")
# Build a measurement: clamp → sum → add Laplace noise
space = dp.space_of(list[float])
bounded_sum = (
space >>
dp.t.then_clamp(bounds=(0.0, 100.0)) >>
dp.t.then_sum() >>
dp.m.then_laplace(scale=10.0)
)
# Check the privacy guarantee
print(bounded_sum.map(1)) # ε cost for one invocation
# Execute
salaries = [45000.0, 52000.0, 61000.0, 38000.0, 73000.0]
private_sum = bounded_sum(salaries)
OpenDP’s strength is composability — you chain transformations and measurements, and the framework automatically computes the overall privacy cost. It also verifies that your pipeline is valid (e.g., that you’ve bounded the data before summing).
PipelineDP for large-scale aggregation
PipelineDP integrates with Apache Beam and Spark for differentially private analytics at scale:
# pip install pipeline-dp
import pipeline_dp
# Define the budget
budget_accountant = pipeline_dp.NaiveBudgetAccountant(
total_epsilon=3.0, total_delta=1e-5
)
# Define the aggregation
params = pipeline_dp.AggregateParams(
noise_kind=pipeline_dp.NoiseKind.LAPLACE,
metrics=[
pipeline_dp.Metrics.COUNT,
pipeline_dp.Metrics.SUM,
pipeline_dp.Metrics.MEAN,
],
max_partitions_contributed=3,
max_contributions_per_partition=1,
min_value=0,
max_value=1000,
budget_weight=1.0,
)
# Create the engine for local computation
backend = pipeline_dp.LocalBackend()
dp_engine = pipeline_dp.DPEngine(budget_accountant, backend)
# Input: list of (privacy_unit, partition_key, value)
data = [
("user1", "category_a", 150),
("user1", "category_b", 200),
("user2", "category_a", 300),
# ...
]
data_extractors = pipeline_dp.DataExtractors(
privacy_id_extractor=lambda x: x[0],
partition_extractor=lambda x: x[1],
value_extractor=lambda x: x[2],
)
result = dp_engine.aggregate(data, params, data_extractors)
budget_accountant.compute_budgets()
# Result contains private count, sum, mean per partition
for partition, metrics in result:
print(f"{partition}: {metrics}")
PipelineDP handles contribution bounding automatically — if a user contributes to more partitions than max_partitions_contributed, some contributions are randomly dropped. This bounds sensitivity without manual intervention.
Differentially private machine learning
Training ML models on sensitive data risks memorizing individual records. DP-SGD (Differentially Private Stochastic Gradient Descent) adds noise to gradients during training:
# Using Opacus with PyTorch
# pip install opacus
import torch
from torch import nn, optim
from opacus import PrivacyEngine
model = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
optimizer = optim.SGD(model.parameters(), lr=0.01)
data_loader = ... # your DataLoader
privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=data_loader,
target_epsilon=3.0,
target_delta=1e-5,
epochs=10,
max_grad_norm=1.0, # clip per-sample gradients
)
# Training loop — Opacus handles noise injection
for epoch in range(10):
for batch in data_loader:
optimizer.zero_grad()
loss = nn.functional.mse_loss(model(batch[0]), batch[1])
loss.backward()
optimizer.step()
print(f"Final ε: {privacy_engine.get_epsilon(delta=1e-5):.2f}")
Opacus clips per-sample gradients to bound sensitivity, then adds calibrated Gaussian noise. The privacy accountant tracks total epsilon spent across all training steps using Rényi DP accounting, which gives tighter bounds than basic composition.
Tradeoffs and practical considerations
Dataset size matters enormously. On a dataset of 10 million records, ε=1 produces highly accurate results. On 100 records, the same ε makes results nearly useless. Differential privacy is designed for large-scale data.
Choosing epsilon is a policy decision, not a technical one. There’s no formula that outputs the “right” epsilon. It depends on the harm of disclosure, the value of the analysis, and organizational risk appetite. Document the choice and the reasoning.
Post-processing is free. Any computation on differentially private output doesn’t consume additional privacy budget. You can filter, transform, visualize, or model private outputs without degrading the guarantee.
Sensitivity bounding (clamping) distorts data. When you clamp values to bound sensitivity, you lose information about outliers. If most salaries are $50K-$100K but one is $10M, clamping to [0, 200K] changes the true sum significantly. Choose bounds based on domain knowledge.
The one thing to remember: Production differential privacy requires three calibrated decisions — choosing epsilon (privacy strength), bounding sensitivity (clamping data contributions), and tracking budget (composition) — with Python libraries like OpenDP and PipelineDP handling the noise math correctly.
See Also
- Python Compliance Audit Trails Why your Python app needs a tamper-proof diary that records every important action — like a security camera for your data
- Python Consent Management How Python apps ask permission like a polite guest — and remember exactly what you said yes and no to
- Python Data Anonymization How Python can disguise personal information so well that nobody — not even the original collector — can figure out who it belongs to
- Python Data Retention Policies Why your Python app needs an expiration date for data — just like the one on milk cartons — and what happens when data goes stale
- Python Gdpr Compliance Why Europe's privacy law is like a restaurant that must tell you every ingredient — and how Python apps follow the recipe