LLMs and Synthetic Data: Training on Machine-Generated Text
How synthetic data is reshaping LLM training—from generation strategies and quality filtering to the risks of model collapse and best practices for mixing real and synthetic corpora.
View all llms depths →Depth ladder for this topic:
Why Synthetic Data Matters for LLMs
The internet isn’t infinite—or at least, the useful internet isn’t. As language models have scaled from billions to trillions of parameters, the demand for high-quality training data has outstripped the supply of naturally occurring text. Synthetic data—text generated by models themselves—has emerged as a practical solution to this data bottleneck.
But training on machine-generated text introduces a strange recursion: models learning from their own outputs, or from the outputs of their predecessors. Understanding when this works, when it fails, and how to do it well has become one of the most consequential areas in modern AI research.
What Counts as Synthetic Data?
Synthetic data for LLM training spans a wide spectrum:
- Instruction-response pairs generated by a strong model (the approach popularized by Self-Instruct and Alpaca)
- Paraphrases and augmented variants of existing text
- Translated or back-translated text for multilingual coverage
- Distillation outputs where a large model generates training targets for a smaller one
- Fully synthetic corpora generated from structured prompts or templates
- Code execution traces where models generate code, run it, and use the results as training signal
The key distinction isn’t whether data is “real” vs. “fake”—it’s whether the data provides genuine learning signal that the model couldn’t get from existing sources.
The Case for Synthetic Data
Scaling Beyond Natural Data
Chinchilla scaling laws showed that models need roughly 20 tokens of training data per parameter for compute-optimal training. A 1-trillion-parameter model needs 20 trillion tokens. High-quality English text on the internet is estimated at 5–10 trillion tokens. The math doesn’t work without augmentation.
Targeted Skill Development
Natural data is imbalanced. There’s vastly more casual web text than, say, step-by-step mathematical reasoning or nuanced ethical deliberation. Synthetic data lets you generate exactly the kind of examples you need:
- Chain-of-thought reasoning traces for math and logic
- Multi-turn dialogues for conversational ability
- Edge cases and adversarial examples for robustness
- Domain-specific technical content for specialized applications
Privacy and Compliance
In domains like healthcare and finance, real data carries privacy constraints. Synthetic data can preserve statistical properties while eliminating personally identifiable information—though this requires careful validation.
Cost Efficiency
Generating synthetic data from an existing model is dramatically cheaper than human annotation. A single API call might cost fractions of a cent; a human expert annotator might cost $50/hour.
The Risks: Model Collapse and Quality Degradation
What Is Model Collapse?
When models train on outputs from previous model generations, errors and biases compound. Research from Shumailov et al. (2023) demonstrated that iterative training on model-generated data causes the output distribution to narrow and degrade—a phenomenon called model collapse.
The intuition is straightforward: each generation of model slightly distorts the data distribution. Over iterations, the tails of the distribution get clipped. Rare but important patterns vanish. The model converges toward a blander, less diverse version of language.
Quality Filtering Is Non-Negotiable
Raw synthetic data is noisy. Models hallucinate, repeat themselves, and produce text that’s grammatically correct but factually wrong or stylistically flat. Effective synthetic data pipelines require aggressive filtering:
- Deduplication at both exact and near-duplicate levels
- Quality scoring using perplexity, coherence metrics, or a separate classifier
- Factual verification through retrieval-augmented checking or execution-based validation
- Diversity metrics to ensure the synthetic corpus doesn’t collapse into repetitive patterns
- Human spot-checks on random samples to catch systematic failures
The Homogeneity Problem
Models trained heavily on synthetic data risk developing a recognizable “synthetic voice”—fluent but generic, correct but uninteresting. This is especially problematic for creative applications where diversity and surprise are valuable.
Generation Strategies That Work
Seed-and-Expand
Start with a small set of high-quality human-written examples. Use a strong model to generate variations, extensions, and related examples. This anchors the synthetic data to real-world quality while scaling the volume.
Self-Play and Debate
Have models take opposing positions or critique each other’s outputs. This generates data that captures reasoning under disagreement—a type of text that’s rare in natural corpora but valuable for developing nuanced judgment.
Execution-Verified Generation
For code and math, generate candidate solutions, execute them against test cases, and keep only the ones that pass. This provides a hard quality signal that doesn’t depend on another model’s judgment.
Constitutional AI and RLAIF
Use a model to generate responses, then use the same (or another) model to critique and rank those responses against a set of principles. The resulting preference data drives reinforcement learning from AI feedback (RLAIF), which has proven effective for alignment without requiring massive human annotation.
Curriculum-Based Generation
Generate synthetic data of increasing difficulty, matching the model’s current capability level. Start with simple examples, verify the model can learn them, then generate harder ones. This mirrors effective human pedagogy.
Mixing Real and Synthetic Data
The consensus in practice is clear: don’t go all-synthetic. The best results come from carefully controlled mixtures:
- Foundation training should be predominantly real data, with synthetic data filling specific gaps
- Fine-tuning can tolerate higher synthetic ratios (50–90% synthetic is common for instruction tuning)
- Alignment training increasingly relies on synthetic preference data, but anchored to human-validated principles
A practical mixing strategy:
- Train the base model on mostly real data
- Generate synthetic data using the base model (or a stronger model)
- Filter aggressively
- Mix filtered synthetic data with held-out real data for fine-tuning
- Validate on purely real benchmarks
Quality Assurance for Synthetic Pipelines
Benchmark Contamination
A subtle risk: if your synthetic data generator has seen your evaluation benchmarks during its own training, it may generate text that leaks benchmark answers into your training set. This inflates scores without improving real capability. Mitigation requires careful decontamination and the use of held-out, novel evaluation sets.
Distribution Monitoring
Track statistical properties of your synthetic data over time. If the vocabulary diversity drops, sentence length variance decreases, or topic coverage narrows, your pipeline may be drifting toward collapse.
A/B Testing in Production
The ultimate test of synthetic data quality is downstream performance. Run controlled experiments: train identical models with and without specific synthetic data components, and measure real-world task performance—not just loss on a held-out set.
The State of the Art in 2026
Synthetic data has moved from a research curiosity to an industrial necessity. Key developments:
- Phi-series models from Microsoft demonstrated that small models trained on carefully curated synthetic data can match much larger models trained on raw web data
- Constitutional AI approaches have made RLAIF standard practice for alignment
- Synthetic data marketplaces have emerged, with companies specializing in generating domain-specific training data
- Regulatory scrutiny is increasing, with the EU AI Act requiring disclosure of synthetic data use in training
The field is converging on a pragmatic view: synthetic data is a tool, not a shortcut. Used well, it unlocks capabilities that natural data alone can’t provide. Used carelessly, it degrades model quality in ways that are hard to detect and harder to fix.
Practical Recommendations
- Always maintain a real-data anchor. Never let synthetic data be your only training signal.
- Filter harder than you think necessary. The cost of filtering is low; the cost of training on bad data is high.
- Monitor for collapse. Track diversity metrics across generations.
- Use execution-based verification wherever possible (code, math, structured outputs).
- Invest in evaluation. Your evaluation suite needs to be stronger than your synthetic data pipeline, or you won’t know when things go wrong.
- Document your pipeline. For reproducibility and compliance, record what models generated your synthetic data, what filters you applied, and what mixing ratios you used.
Synthetic data isn’t a silver bullet—but it’s an essential tool in the modern LLM toolkit. The teams that use it most effectively are the ones that treat it with the same rigor they’d apply to any other engineering system: measure everything, trust nothing by default, and iterate based on evidence.
Simplify
← Speculative Decoding: How LLMs Generate Text Faster Without Losing Quality
Go deeper
LLM Tool Use and Function Calling: Patterns That Actually Work →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.