LLMs and Synthetic Data: Training on Machine-Generated Text

Why Synthetic Data Matters for LLMs

The internet isn’t infinite—or at least, the useful internet isn’t. As language models have scaled from billions to trillions of parameters, the demand for high-quality training data has outstripped the supply of naturally occurring text. Synthetic data—text generated by models themselves—has emerged as a practical solution to this data bottleneck.

But training on machine-generated text introduces a strange recursion: models learning from their own outputs, or from the outputs of their predecessors. Understanding when this works, when it fails, and how to do it well has become one of the most consequential areas in modern AI research.

What Counts as Synthetic Data?

Synthetic data for LLM training spans a wide spectrum:

Instruction-response pairs generated by a strong model (the approach popularized by Self-Instruct and Alpaca)
Paraphrases and augmented variants of existing text
Translated or back-translated text for multilingual coverage
Distillation outputs where a large model generates training targets for a smaller one
Fully synthetic corpora generated from structured prompts or templates
Code execution traces where models generate code, run it, and use the results as training signal

The key distinction isn’t whether data is “real” vs. “fake”—it’s whether the data provides genuine learning signal that the model couldn’t get from existing sources.

The Case for Synthetic Data

Scaling Beyond Natural Data

Chinchilla scaling laws showed that models need roughly 20 tokens of training data per parameter for compute-optimal training. A 1-trillion-parameter model needs 20 trillion tokens. High-quality English text on the internet is estimated at 5–10 trillion tokens. The math doesn’t work without augmentation.

Targeted Skill Development

Natural data is imbalanced. There’s vastly more casual web text than, say, step-by-step mathematical reasoning or nuanced ethical deliberation. Synthetic data lets you generate exactly the kind of examples you need:

Chain-of-thought reasoning traces for math and logic
Multi-turn dialogues for conversational ability
Edge cases and adversarial examples for robustness
Domain-specific technical content for specialized applications

Privacy and Compliance

In domains like healthcare and finance, real data carries privacy constraints. Synthetic data can preserve statistical properties while eliminating personally identifiable information—though this requires careful validation.

Cost Efficiency

Generating synthetic data from an existing model is dramatically cheaper than human annotation. A single API call might cost fractions of a cent; a human expert annotator might cost $50/hour.

The Risks: Model Collapse and Quality Degradation

What Is Model Collapse?

When models train on outputs from previous model generations, errors and biases compound. Research from Shumailov et al. (2023) demonstrated that iterative training on model-generated data causes the output distribution to narrow and degrade—a phenomenon called model collapse.

The intuition is straightforward: each generation of model slightly distorts the data distribution. Over iterations, the tails of the distribution get clipped. Rare but important patterns vanish. The model converges toward a blander, less diverse version of language.

Quality Filtering Is Non-Negotiable

Raw synthetic data is noisy. Models hallucinate, repeat themselves, and produce text that’s grammatically correct but factually wrong or stylistically flat. Effective synthetic data pipelines require aggressive filtering:

Deduplication at both exact and near-duplicate levels
Quality scoring using perplexity, coherence metrics, or a separate classifier
Factual verification through retrieval-augmented checking or execution-based validation
Diversity metrics to ensure the synthetic corpus doesn’t collapse into repetitive patterns
Human spot-checks on random samples to catch systematic failures

The Homogeneity Problem

Models trained heavily on synthetic data risk developing a recognizable “synthetic voice”—fluent but generic, correct but uninteresting. This is especially problematic for creative applications where diversity and surprise are valuable.

Generation Strategies That Work

Seed-and-Expand

Start with a small set of high-quality human-written examples. Use a strong model to generate variations, extensions, and related examples. This anchors the synthetic data to real-world quality while scaling the volume.

Self-Play and Debate

Have models take opposing positions or critique each other’s outputs. This generates data that captures reasoning under disagreement—a type of text that’s rare in natural corpora but valuable for developing nuanced judgment.

Execution-Verified Generation

For code and math, generate candidate solutions, execute them against test cases, and keep only the ones that pass. This provides a hard quality signal that doesn’t depend on another model’s judgment.

Constitutional AI and RLAIF

Use a model to generate responses, then use the same (or another) model to critique and rank those responses against a set of principles. The resulting preference data drives reinforcement learning from AI feedback (RLAIF), which has proven effective for alignment without requiring massive human annotation.

Curriculum-Based Generation

Generate synthetic data of increasing difficulty, matching the model’s current capability level. Start with simple examples, verify the model can learn them, then generate harder ones. This mirrors effective human pedagogy.

Mixing Real and Synthetic Data

The consensus in practice is clear: don’t go all-synthetic. The best results come from carefully controlled mixtures:

Foundation training should be predominantly real data, with synthetic data filling specific gaps
Fine-tuning can tolerate higher synthetic ratios (50–90% synthetic is common for instruction tuning)
Alignment training increasingly relies on synthetic preference data, but anchored to human-validated principles

A practical mixing strategy:

Train the base model on mostly real data
Generate synthetic data using the base model (or a stronger model)
Filter aggressively
Mix filtered synthetic data with held-out real data for fine-tuning
Validate on purely real benchmarks

Quality Assurance for Synthetic Pipelines

Benchmark Contamination

A subtle risk: if your synthetic data generator has seen your evaluation benchmarks during its own training, it may generate text that leaks benchmark answers into your training set. This inflates scores without improving real capability. Mitigation requires careful decontamination and the use of held-out, novel evaluation sets.

Distribution Monitoring

Track statistical properties of your synthetic data over time. If the vocabulary diversity drops, sentence length variance decreases, or topic coverage narrows, your pipeline may be drifting toward collapse.

A/B Testing in Production

The ultimate test of synthetic data quality is downstream performance. Run controlled experiments: train identical models with and without specific synthetic data components, and measure real-world task performance—not just loss on a held-out set.

The State of the Art in 2026

Synthetic data has moved from a research curiosity to an industrial necessity. Key developments:

Phi-series models from Microsoft demonstrated that small models trained on carefully curated synthetic data can match much larger models trained on raw web data
Constitutional AI approaches have made RLAIF standard practice for alignment
Synthetic data marketplaces have emerged, with companies specializing in generating domain-specific training data
Regulatory scrutiny is increasing, with the EU AI Act requiring disclosure of synthetic data use in training

The field is converging on a pragmatic view: synthetic data is a tool, not a shortcut. Used well, it unlocks capabilities that natural data alone can’t provide. Used carelessly, it degrades model quality in ways that are hard to detect and harder to fix.

Practical Recommendations

Always maintain a real-data anchor. Never let synthetic data be your only training signal.
Filter harder than you think necessary. The cost of filtering is low; the cost of training on bad data is high.
Monitor for collapse. Track diversity metrics across generations.
Use execution-based verification wherever possible (code, math, structured outputs).
Invest in evaluation. Your evaluation suite needs to be stronger than your synthetic data pipeline, or you won’t know when things go wrong.
Document your pipeline. For reproducibility and compliance, record what models generated your synthetic data, what filters you applied, and what mixing ratios you used.

Synthetic data isn’t a silver bullet—but it’s an essential tool in the modern LLM toolkit. The teams that use it most effectively are the ones that treat it with the same rigor they’d apply to any other engineering system: measure everything, trust nothing by default, and iterate based on evidence.