Synthetic Data — Core Concepts
Three Types of Synthetic Data
1. Simulation-Based Synthetic Data
The oldest and most reliable approach: build a physics or rule-based simulator and generate data from it.
Autonomous driving: Waymo, Tesla, and Cruise generate billions of miles of simulated driving data using game-engine-quality simulation environments. CARLA (open-source) and NVIDIA DRIVE Sim are widely used. Key advantage: you can simulate dangerous scenarios (pedestrians running into traffic, vehicles spinning out) that you’d never deliberately create in the real world.
Robotics: OpenAI’s Dactyl (2019) trained a robotic hand to solve a Rubik’s cube entirely in simulation using domain randomization — randomizing lighting, friction, and object properties to make the policy transfer to real-world hardware.
Medical imaging: Simulated MRI/CT scans using computational phantoms (mathematical models of human anatomy) + radiation transport simulation. Used to generate rare pathology cases and test imaging algorithms without requiring patient data.
Limitations: The “sim-to-real gap” — simulation always differs from reality. Policies trained purely on simulation often fail in edge cases where simulator assumptions break down.
2. GAN-Based Tabular Synthesis
GANs (see generative-adversarial-networks) can generate realistic tabular data (spreadsheets) that matches the statistical distribution of real data.
CTGAN (Xu et al., 2019): Conditional Tabular GAN. Standard GANs fail on tabular data because:
- Mixed data types (continuous + categorical + binary)
- Imbalanced categories (some categories rare)
- Multi-modal distributions within columns
CTGAN addresses these: applies mode-specific normalization, trains with conditional vector (condition on each category to ensure minority categories are generated), and uses PacGAN (packs multiple samples) to reduce mode collapse.
TVAE: Uses a VAE (Variational Autoencoder) instead of GAN for tabular synthesis. More stable training; slightly different quality profile.
Synthetic Minority Over-sampling (SMOTE, 2002): Classic technique for class imbalance — generate synthetic minority class examples by interpolating between existing minority examples in feature space. Simple, fast, widely used.
3. LLM-Based Synthetic Data
The most recent and increasingly dominant approach: use large language models to generate synthetic text, code, or structured data.
For instruction tuning: “Self-Instruct” (Wang et al., 2022) — use GPT-3 to generate instruction-following examples: given a seed task, generate 100 variations. Used to bootstrap instruction-tuned models without massive human annotation.
Textbooks Are All You Need / Microsoft Phi-1 (2023): Generated 6 billion tokens of Python textbook content using GPT-3.5. A 1.3B parameter model trained on this synthetic data matched models 10x its size on coding benchmarks. The insight: high-quality, structured educational content is more valuable than the same quantity of internet text.
Phi-2 and Phi-3 (2023–2024): Extended to “code exercises,” “synthetic textbooks” across subjects, and “filtered web data” (GPT-4 filters and rewrites web pages to educational standard). Phi-3-mini (3.8B parameters) outperformed Llama 2 70B on many benchmarks — a 20x parameter efficiency gain largely attributed to synthetic data quality.
Differential Privacy for Synthetic Data
The core risk of synthetic data: the generative model may memorize specific training examples, meaning the synthetic data could leak individual’s information.
Differentially private synthesis (DP-CTGAN, PATE-GAN): Train the generative model with differential privacy guarantees (see federated-learning topic). The synthetic data is then provably at most $(ε, δ)$-close to real data in terms of what an adversary could learn about individuals.
DP-SGD in GAN training: Add calibrated Gaussian noise to gradients during generator/discriminator training. This prevents the generator from overfitting to individual examples.
Trade-offs: DP training degrades synthetic data quality. At ε=1 (strong privacy), CTGAN produces synthetic data with measurably worse statistical fidelity. At ε=10 (moderate privacy), quality is acceptable for most purposes.
Model Collapse: The Risk of Training on AI-Generated Data
Shumailov et al. (2023) “The Curse of Recursion” showed that iteratively training generative models on their own output leads to model collapse — quality degradation that accumulates across generations.
The mechanism: Each generation of synthetic data has slightly higher variance and some distribution shift from the true data. When used as training data for the next generation, these errors compound. After sufficient generations, the model forgets the true distribution’s tail behaviors.
Concretely: a language model generates synthetic text → fine-tune on it → generate more synthetic text → the resulting model’s outputs become less diverse, eventually collapsing to common patterns.
The “internet” problem: As more AI-generated content appears on the web, future LLMs trained on web scrapes will inadvertently train on synthetic data. If that synthetic data is degraded, this could affect future model quality.
Mitigations:
- Always maintain some percentage of real data in training
- Use only AI-generated data that’s been human-curated and validated
- Filter generated data against quality metrics before training
Evaluating Synthetic Data Quality
Statistical fidelity: Do synthetic and real data have similar distributions?
- Per-column statistics (mean, std, quartiles)
- Cross-column correlations
- Multivariate distribution similarity (using MMD or FID)
Machine learning efficacy: Train a model on synthetic data, evaluate on real data. Compare to model trained on real data. The gap quantifies quality loss.
Privacy metrics: Nearest neighbor distance ratio (NNDR) — if synthetic records are very close to real records, privacy is at risk.
One thing to remember: Synthetic data is not “fake data that doesn’t matter” — at its best it captures real statistical structure and enables AI training at scales and in domains where real data collection is impossible, but it always requires careful quality validation and maintaining grounding in real data to avoid generational drift.
See Also
- Data Flywheel Why AI companies with more users get smarter AI — the self-reinforcing loop that turns user interactions into competitive moat.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.