Synthetic Data — Deep Dive

Model Collapse: Formal Analysis

Shumailov et al. (2023) formalized model collapse as a property of iterative training on generated data.

Setup: Let $p_0$ be the true data distribution. Train a model $p_{\theta_1}$ on samples from $p_0$. Generate synthetic data from $p_{\theta_1}$, train $p_{\theta_2}$ on this. Repeat for $n$ generations.

Early model collapse: The learned distribution $p_{\theta_n}$ has incorrect mean or variance compared to $p_0$. This happens because of finite sampling — the training set for each generation is a finite sample, so the learned distribution shifts by the estimation error.

Late model collapse: After many generations, the distribution collapses entirely — the model generates nearly identical outputs. The tail of the distribution (rare but important patterns) is progressively lost.

Mathematical characterization: For a Gaussian model, after $n$ generations: $$\mu_n \approx \mu_0, \quad \sigma_n^2 \approx \sigma_0^2 \cdot \left(\frac{n}{n+1}\right)^n$$

As $n \rightarrow \infty$, $\sigma_n^2 \rightarrow 0$ — variance collapses to zero. The model learns only the mean.

Mitigation analysis (Gerstgrasser et al., 2024 “Model Collapse Demystified”): If a fraction $\alpha$ of real data is maintained at each training step: $$\sigma_n^2 \approx \alpha \sigma_0^2$$

Any positive fraction of real data prevents full collapse. The key policy implication: training on even 1–5% real data among synthetic data maintains diversity.

DP-SGD for Differentially Private Synthesis

Training generative models with DP-SGD (differentially private SGD) provides formal privacy guarantees for the synthetic data output.

Rényi DP (RDP) composition: For $T$ training steps, each with privacy cost $(\alpha, \epsilon_0)$ per step, total cost under RDP: $$\epsilon_{total}(\alpha) = T \cdot \epsilon_0(\alpha)$$

Converting to $(\epsilon, \delta)$-DP via RDP: $$\epsilon = \epsilon_{total}(\alpha) + \frac{\log(1/\delta)}{\alpha - 1}$$

For CTGAN trained on 100K healthcare records with 1000 training steps, noise multiplier $\sigma = 1.1$, batch size 256:

  • Per-step $(10, 10^{-5})$-RDP
  • After 1000 steps: $(10, 0.01)$-DP

This means: the probability that any individual record’s presence in training data changes any output of the synthesis process by more than $e^{0.01} \approx 1.01$ is less than $10^{-5}$.

PATE-GAN (Jordon et al., 2019): Alternative to DP-SGD — train multiple “teacher” discriminators on partitions of the real data (no data sharing between partitions), use their noisy aggregate vote to train a “student” discriminator. Provides privacy by noisy label passing rather than noisy gradient computation. Often better quality than DP-SGD at the same privacy budget.

Constitutional AI and Synthetic Data Pipelines

Anthropic’s Constitutional AI (Bai et al., 2022) uses a novel synthetic data pipeline:

Phase 1 — SL-CAI (Supervised Learning from Constitutional AI):

  1. Generate harmful responses using a “red team” prompt (asking for harmful content)
  2. Have the model critique its response against a constitutional principle
  3. Have the model revise based on the critique
  4. Fine-tune on the revised responses

The critique-revision pairs are synthetic training examples — generated entirely by AI, curated by constitutional principles. Each starting prompt generates multiple critique-revision pairs, scaling the dataset.

Phase 2 — RL-CAI (from RLAIF):

  1. For each prompt, generate multiple response options
  2. Have the model score pairs against constitutional principles
  3. Use these AI-generated preference labels to train a reward model
  4. PPO with the AI-trained reward model

Anthropic reported that RLAIF produces models comparable to RLHF with human feedback, at dramatically lower cost. The key: the AI rater’s constitutional principles provide consistent, scalable preference judgments.

LLM-as-Annotator at scale: Google’s FLAN (2022) and instruction tuning literature systematically used LLMs to generate, paraphrase, and annotate training data. GPT-4’s annotation was used to train models across many papers in 2023 — creating a curious dynamic where smaller models are trained on GPT-4’s “distilled knowledge.”

Domain Randomization: Theory and Practice

Domain randomization is the simulation-based approach to the sim-to-real gap: randomize all simulator parameters widely during training, hoping the real world falls within the randomized distribution.

Formal framework: Let $\phi$ be a vector of environment parameters (friction, mass, texture, lighting). During training, sample $\phi \sim p(\phi)$ for each episode. The policy $\pi$ must be robust to all $\phi$ in the support of $p$.

At deployment, the real environment has specific parameters $\phi_{real}$. If $\phi_{real}$ is in the support of $p(\phi)$, the policy generalizes.

OpenAI Dactyl: Randomized friction (0.5–1.5x), mass (0.5–1.5x), joint damping, tendon springiness, gains, joint positions (within bounds), surface normals, texture, color, geometry. The randomized simulator produced policies that generalized to the physical Dexterous Hand robot.

The curse of dimensionality in randomization: With many parameters, uniform randomization is inefficient — most combinations aren’t physically realistic. Adaptive domain randomization (ADR, OpenAI 2020) starts with low randomization and gradually increases it as the policy improves:

  1. Start with tightly constrained parameters (narrow $p(\phi)$)
  2. Run current policy, measure performance
  3. If policy succeeds above threshold, expand parameter range slightly
  4. Repeat — the policy is always trained at the edge of its capability

ADR naturally schedules curriculum — hard environments are introduced only after the policy is ready.

The Economics of Synthetic Data

Real data collection costs:

  • Labeled image data (ImageNet quality): ~$1–5 per image
  • Medical record annotation: ~$50–200 per record
  • Complex NLP annotation (NER, relation): ~$5–20 per document
  • Code review annotation: ~$20–50 per example

Synthetic data costs:

  • GPT-4 generation: ~$0.01–0.10 per example
  • Open-source LLM generation: ~$0.001–0.01 per example (compute cost)
  • Simulation data: ~$0.0001 per example (compute cost)

The economics strongly favor synthetic data at scale. However, quality-adjusted costs are more complex:

  • Synthetic data often requires more examples to match real data’s information density
  • Quality filtering (remove bad examples) adds 10–30% overhead
  • Human validation sampling (verify representative quality) adds 5–15%

The practical rule: use real data for rare, high-stakes tasks where quality is paramount and quantity is limited. Use synthetic data for common patterns, scale augmentation, and privacy-sensitive domains.

Anthropic (2024) reported that for instruction tuning, “synthetic textbook” quality data (generated by GPT-4) was 10–50x more valuable per token than raw web text — the educational clarity and structure of synthetic data outweighs its artificial origin.

One thing to remember: The economic and practical advantages of synthetic data are real and large, but model collapse risk means it’s most valuable as a complement to real data — the ideal training set is a carefully curated mix where synthetic data extends coverage of rare scenarios and provides educational clarity, while real data anchors the model to genuine distribution.

synthetic-datamodel-collapsedp-sgdconstitutional-aidomain-randomizationdata-economics

See Also

  • Data Flywheel Why AI companies with more users get smarter AI — the self-reinforcing loop that turns user interactions into competitive moat.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.