Differential Privacy in Python — Deep Dive

Implement differential privacy in Python with OpenDP and PipelineDP — Laplace and Gaussian mechanisms, budget accounting, private aggregations, and privacy-preserving machine learning

Implementing basic mechanisms from scratch

Before using libraries, understanding the raw mechanics helps you reason about privacy guarantees.

import numpy as np
from typing import Callable

class LaplaceMechanism:
    """Add Laplace noise calibrated to sensitivity and epsilon."""
    
    def __init__(self, epsilon: float, sensitivity: float = 1.0):
        self.epsilon = epsilon
        self.sensitivity = sensitivity
        self.scale = sensitivity / epsilon
    
    def release(self, true_value: float) -> float:
        noise = np.random.laplace(0, self.scale)
        return true_value + noise
    
    def release_count(self, true_count: int) -> float:
        """For counting queries, sensitivity is always 1."""
        noise = np.random.laplace(0, 1.0 / self.epsilon)
        return true_count + noise

class GaussianMechanism:
    """Add Gaussian noise for (epsilon, delta)-DP."""
    
    def __init__(self, epsilon: float, delta: float, sensitivity: float = 1.0):
        self.epsilon = epsilon
        self.delta = delta
        self.sensitivity = sensitivity
        # Calibrate sigma using the analytic Gaussian mechanism
        self.sigma = (
            sensitivity * np.sqrt(2 * np.log(1.25 / delta)) / epsilon
        )
    
    def release(self, true_value: float) -> float:
        noise = np.random.normal(0, self.sigma)
        return true_value + noise


# Example: private count
mechanism = LaplaceMechanism(epsilon=1.0)
true_count = 42857  # actual number of users who clicked
private_count = mechanism.release_count(true_count)
# private_count ≈ 42857 ± ~1 (noise is tiny relative to large counts)

Privacy budget accounting

A production system needs a budget tracker that prevents overspending:

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class PrivacyBudget:
    total_epsilon: float
    total_delta: float = 1e-5
    spent_epsilon: float = 0.0
    spent_delta: float = 0.0
    queries: list[dict] = field(default_factory=list)
    
    @property
    def remaining_epsilon(self) -> float:
        return self.total_epsilon - self.spent_epsilon
    
    def can_afford(self, epsilon: float, delta: float = 0.0) -> bool:
        return (
            self.spent_epsilon + epsilon <= self.total_epsilon
            and self.spent_delta + delta <= self.total_delta
        )
    
    def spend(self, epsilon: float, delta: float = 0.0, description: str = "") -> None:
        if not self.can_afford(epsilon, delta):
            raise PrivacyBudgetExhausted(
                f"Requested ε={epsilon}, δ={delta} but only "
                f"ε={self.remaining_epsilon:.4f} remaining"
            )
        self.spent_epsilon += epsilon
        self.spent_delta += delta
        self.queries.append({
            "epsilon": epsilon,
            "delta": delta,
            "description": description,
            "timestamp": datetime.utcnow().isoformat(),
        })

class PrivacyBudgetExhausted(Exception):
    pass

class PrivateQueryEngine:
    """Execute queries with automatic budget tracking."""
    
    def __init__(self, budget: PrivacyBudget):
        self.budget = budget
    
    def private_count(self, data, predicate, epsilon: float) -> float:
        true_count = sum(1 for item in data if predicate(item))
        self.budget.spend(epsilon, description=f"count query")
        mechanism = LaplaceMechanism(epsilon=epsilon, sensitivity=1.0)
        return mechanism.release(true_count)
    
    def private_sum(
        self, data, value_fn, epsilon: float, 
        lower_bound: float, upper_bound: float,
    ) -> float:
        # Clamp contributions to bound sensitivity
        clamped = [
            max(lower_bound, min(upper_bound, value_fn(item)))
            for item in data
        ]
        true_sum = sum(clamped)
        sensitivity = upper_bound - lower_bound
        self.budget.spend(epsilon, description=f"sum query (sensitivity={sensitivity})")
        mechanism = LaplaceMechanism(epsilon=epsilon, sensitivity=sensitivity)
        return mechanism.release(true_sum)
    
    def private_mean(
        self, data, value_fn, epsilon: float,
        lower_bound: float, upper_bound: float,
    ) -> float:
        """Split budget between count and sum, then divide."""
        eps_count = epsilon / 2
        eps_sum = epsilon / 2
        
        n = len(data)
        noisy_count = self.private_count(
            data, lambda _: True, eps_count
        )
        # Undo the count we just did and do a sum instead
        self.budget.spent_epsilon -= eps_count  # rollback for recount
        self.budget.queries.pop()
        
        noisy_sum = self.private_sum(
            data, value_fn, eps_sum, lower_bound, upper_bound
        )
        noisy_n = max(1, noisy_count)  # avoid division by zero
        
        self.budget.spend(eps_count, description="count for mean")
        return noisy_sum / noisy_n

Using OpenDP

OpenDP is a Rust-backed Python library providing vetted differential privacy primitives with formal proofs:

# pip install opendp
import opendp.prelude as dp
dp.enable_features("contrib", "honest-but-curious")

# Build a measurement: clamp → sum → add Laplace noise
space = dp.space_of(list[float])
bounded_sum = (
    space >>
    dp.t.then_clamp(bounds=(0.0, 100.0)) >>
    dp.t.then_sum() >>
    dp.m.then_laplace(scale=10.0)
)

# Check the privacy guarantee
print(bounded_sum.map(1))  # ε cost for one invocation

# Execute
salaries = [45000.0, 52000.0, 61000.0, 38000.0, 73000.0]
private_sum = bounded_sum(salaries)

OpenDP’s strength is composability — you chain transformations and measurements, and the framework automatically computes the overall privacy cost. It also verifies that your pipeline is valid (e.g., that you’ve bounded the data before summing).

PipelineDP for large-scale aggregation

PipelineDP integrates with Apache Beam and Spark for differentially private analytics at scale:

# pip install pipeline-dp
import pipeline_dp

# Define the budget
budget_accountant = pipeline_dp.NaiveBudgetAccountant(
    total_epsilon=3.0, total_delta=1e-5
)

# Define the aggregation
params = pipeline_dp.AggregateParams(
    noise_kind=pipeline_dp.NoiseKind.LAPLACE,
    metrics=[
        pipeline_dp.Metrics.COUNT,
        pipeline_dp.Metrics.SUM,
        pipeline_dp.Metrics.MEAN,
    ],
    max_partitions_contributed=3,
    max_contributions_per_partition=1,
    min_value=0,
    max_value=1000,
    budget_weight=1.0,
)

# Create the engine for local computation
backend = pipeline_dp.LocalBackend()
dp_engine = pipeline_dp.DPEngine(budget_accountant, backend)

# Input: list of (privacy_unit, partition_key, value)
data = [
    ("user1", "category_a", 150),
    ("user1", "category_b", 200),
    ("user2", "category_a", 300),
    # ...
]

data_extractors = pipeline_dp.DataExtractors(
    privacy_id_extractor=lambda x: x[0],
    partition_extractor=lambda x: x[1],
    value_extractor=lambda x: x[2],
)

result = dp_engine.aggregate(data, params, data_extractors)
budget_accountant.compute_budgets()

# Result contains private count, sum, mean per partition
for partition, metrics in result:
    print(f"{partition}: {metrics}")

PipelineDP handles contribution bounding automatically — if a user contributes to more partitions than max_partitions_contributed, some contributions are randomly dropped. This bounds sensitivity without manual intervention.

Differentially private machine learning

Training ML models on sensitive data risks memorizing individual records. DP-SGD (Differentially Private Stochastic Gradient Descent) adds noise to gradients during training:

# Using Opacus with PyTorch
# pip install opacus
import torch
from torch import nn, optim
from opacus import PrivacyEngine

model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
)

optimizer = optim.SGD(model.parameters(), lr=0.01)
data_loader = ...  # your DataLoader

privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=data_loader,
    target_epsilon=3.0,
    target_delta=1e-5,
    epochs=10,
    max_grad_norm=1.0,  # clip per-sample gradients
)

# Training loop — Opacus handles noise injection
for epoch in range(10):
    for batch in data_loader:
        optimizer.zero_grad()
        loss = nn.functional.mse_loss(model(batch[0]), batch[1])
        loss.backward()
        optimizer.step()

print(f"Final ε: {privacy_engine.get_epsilon(delta=1e-5):.2f}")

Opacus clips per-sample gradients to bound sensitivity, then adds calibrated Gaussian noise. The privacy accountant tracks total epsilon spent across all training steps using Rényi DP accounting, which gives tighter bounds than basic composition.

Tradeoffs and practical considerations

Dataset size matters enormously. On a dataset of 10 million records, ε=1 produces highly accurate results. On 100 records, the same ε makes results nearly useless. Differential privacy is designed for large-scale data.

Choosing epsilon is a policy decision, not a technical one. There’s no formula that outputs the “right” epsilon. It depends on the harm of disclosure, the value of the analysis, and organizational risk appetite. Document the choice and the reasoning.

Post-processing is free. Any computation on differentially private output doesn’t consume additional privacy budget. You can filter, transform, visualize, or model private outputs without degrading the guarantee.

Sensitivity bounding (clamping) distorts data. When you clamp values to bound sensitivity, you lose information about outliers. If most salaries are $50K-$100K but one is $10M, clamping to [0, 200K] changes the true sum significantly. Choose bounds based on domain knowledge.

The one thing to remember: Production differential privacy requires three calibrated decisions — choosing epsilon (privacy strength), bounding sensitivity (clamping data contributions), and tracking budget (composition) — with Python libraries like OpenDP and PipelineDP handling the noise math correctly.

pythonprivacydifferential-privacystatistics