Synthetic Data — Core Concepts

GAN-based, simulation-based, and LLM-based synthetic data generation, differential privacy for synthetic datasets, the model collapse risk, and Microsoft Phi's synthetic data success.

Three Types of Synthetic Data

1. Simulation-Based Synthetic Data

The oldest and most reliable approach: build a physics or rule-based simulator and generate data from it.

Autonomous driving: Waymo, Tesla, and Cruise generate billions of miles of simulated driving data using game-engine-quality simulation environments. CARLA (open-source) and NVIDIA DRIVE Sim are widely used. Key advantage: you can simulate dangerous scenarios (pedestrians running into traffic, vehicles spinning out) that you’d never deliberately create in the real world.

Robotics: OpenAI’s Dactyl (2019) trained a robotic hand to solve a Rubik’s cube entirely in simulation using domain randomization — randomizing lighting, friction, and object properties to make the policy transfer to real-world hardware.

Medical imaging: Simulated MRI/CT scans using computational phantoms (mathematical models of human anatomy) + radiation transport simulation. Used to generate rare pathology cases and test imaging algorithms without requiring patient data.

Limitations: The “sim-to-real gap” — simulation always differs from reality. Policies trained purely on simulation often fail in edge cases where simulator assumptions break down.

2. GAN-Based Tabular Synthesis

GANs (see generative-adversarial-networks) can generate realistic tabular data (spreadsheets) that matches the statistical distribution of real data.

CTGAN (Xu et al., 2019): Conditional Tabular GAN. Standard GANs fail on tabular data because:

Mixed data types (continuous + categorical + binary)
Imbalanced categories (some categories rare)
Multi-modal distributions within columns

CTGAN addresses these: applies mode-specific normalization, trains with conditional vector (condition on each category to ensure minority categories are generated), and uses PacGAN (packs multiple samples) to reduce mode collapse.

TVAE: Uses a VAE (Variational Autoencoder) instead of GAN for tabular synthesis. More stable training; slightly different quality profile.

Synthetic Minority Over-sampling (SMOTE, 2002): Classic technique for class imbalance — generate synthetic minority class examples by interpolating between existing minority examples in feature space. Simple, fast, widely used.

3. LLM-Based Synthetic Data

The most recent and increasingly dominant approach: use large language models to generate synthetic text, code, or structured data.

For instruction tuning: “Self-Instruct” (Wang et al., 2022) — use GPT-3 to generate instruction-following examples: given a seed task, generate 100 variations. Used to bootstrap instruction-tuned models without massive human annotation.

Textbooks Are All You Need / Microsoft Phi-1 (2023): Generated 6 billion tokens of Python textbook content using GPT-3.5. A 1.3B parameter model trained on this synthetic data matched models 10x its size on coding benchmarks. The insight: high-quality, structured educational content is more valuable than the same quantity of internet text.

Phi-2 and Phi-3 (2023–2024): Extended to “code exercises,” “synthetic textbooks” across subjects, and “filtered web data” (GPT-4 filters and rewrites web pages to educational standard). Phi-3-mini (3.8B parameters) outperformed Llama 2 70B on many benchmarks — a 20x parameter efficiency gain largely attributed to synthetic data quality.

Differential Privacy for Synthetic Data

The core risk of synthetic data: the generative model may memorize specific training examples, meaning the synthetic data could leak individual’s information.

Differentially private synthesis (DP-CTGAN, PATE-GAN): Train the generative model with differential privacy guarantees (see federated-learning topic). The synthetic data is then provably at most $(ε, δ)$-close to real data in terms of what an adversary could learn about individuals.

DP-SGD in GAN training: Add calibrated Gaussian noise to gradients during generator/discriminator training. This prevents the generator from overfitting to individual examples.

Trade-offs: DP training degrades synthetic data quality. At ε=1 (strong privacy), CTGAN produces synthetic data with measurably worse statistical fidelity. At ε=10 (moderate privacy), quality is acceptable for most purposes.

Model Collapse: The Risk of Training on AI-Generated Data

Shumailov et al. (2023) “The Curse of Recursion” showed that iteratively training generative models on their own output leads to model collapse — quality degradation that accumulates across generations.

The mechanism: Each generation of synthetic data has slightly higher variance and some distribution shift from the true data. When used as training data for the next generation, these errors compound. After sufficient generations, the model forgets the true distribution’s tail behaviors.

Concretely: a language model generates synthetic text → fine-tune on it → generate more synthetic text → the resulting model’s outputs become less diverse, eventually collapsing to common patterns.

The “internet” problem: As more AI-generated content appears on the web, future LLMs trained on web scrapes will inadvertently train on synthetic data. If that synthetic data is degraded, this could affect future model quality.

Mitigations:

Always maintain some percentage of real data in training
Use only AI-generated data that’s been human-curated and validated
Filter generated data against quality metrics before training

Evaluating Synthetic Data Quality

Statistical fidelity: Do synthetic and real data have similar distributions?

Per-column statistics (mean, std, quartiles)
Cross-column correlations
Multivariate distribution similarity (using MMD or FID)

Machine learning efficacy: Train a model on synthetic data, evaluate on real data. Compare to model trained on real data. The gap quantifies quality loss.

Privacy metrics: Nearest neighbor distance ratio (NNDR) — if synthetic records are very close to real records, privacy is at risk.

One thing to remember: Synthetic data is not “fake data that doesn’t matter” — at its best it captures real statistical structure and enables AI training at scales and in domains where real data collection is impossible, but it always requires careful quality validation and maintaining grounding in real data to avoid generational drift.

synthetic-datadifferential-privacymodel-collapsephi-modelsdata-generationprivacy