Model Distillation and Compression: Making LLMs Smaller Without Losing Capability
A deep dive into knowledge distillation, pruning, and compression techniques that shrink large language models while preserving most of their capability — with practical guidance on when to use each approach.
View all llms depths →Depth ladder for this topic:
Model Distillation and Compression: Making LLMs Smaller Without Losing Capability
The biggest language models keep getting bigger. GPT-scale systems now push past trillions of parameters, and the trend shows no sign of reversing. But here’s the practical reality: most applications don’t need a trillion-parameter model. They need something fast, cheap, and good enough.
That’s where distillation and compression come in. These techniques let you take the knowledge learned by a massive “teacher” model and pack it into a smaller “student” model that runs on less hardware, responds faster, and costs a fraction to serve.
Why Compression Matters
Running a 70B-parameter model requires multiple high-end GPUs just for inference. For many production workloads — chatbots, code completion, document summarization — the difference between a 70B model and a well-distilled 7B model is barely noticeable to end users. But the cost difference is enormous.
Compression isn’t just about saving money. It’s about accessibility. A compressed model that runs on a single consumer GPU opens up use cases that were previously locked behind expensive cloud infrastructure. Edge deployment, on-device inference, and offline capabilities all depend on smaller models.
Knowledge Distillation: The Core Technique
Knowledge distillation, introduced by Hinton et al. in 2015, is conceptually simple: train a small model to mimic the behavior of a large model.
The key insight is that the teacher model’s output probabilities contain more information than hard labels. When a teacher assigns 70% probability to “cat,” 20% to “dog,” and 10% to “fox,” those soft probabilities encode relationships between classes that binary labels can’t capture.
How It Works
- Train or obtain a teacher model — your large, capable model
- Define a student architecture — smaller, faster, but same input/output format
- Generate soft targets — run training data through the teacher and capture its output distributions
- Train the student — minimize a combination of the standard loss (against ground truth) and a distillation loss (against teacher outputs)
The distillation loss typically uses KL divergence between the teacher and student distributions, scaled by a temperature parameter T:
L_distill = T² × KL(softmax(z_teacher/T) || softmax(z_student/T))
Higher temperatures produce softer probability distributions, which transfer more information about inter-class relationships.
Temperature Selection
Temperature is the most important hyperparameter in distillation. Too low (T=1) and you’re essentially training on hard labels. Too high (T>20) and you flatten the distribution into noise. Most practitioners find sweet spots between T=3 and T=10, depending on the task.
A practical approach: start at T=4, train for a few epochs, evaluate, then sweep T in {2, 4, 6, 8, 10} to find what works for your specific teacher-student pair.
Modern LLM Distillation Approaches
Classical distillation assumed you could access the teacher’s full probability distribution. With proprietary LLMs behind APIs, that’s often impossible. Modern distillation has evolved to handle this constraint.
Response-Based Distillation
The simplest approach: generate a dataset of (prompt, response) pairs from the teacher, then fine-tune the student on those pairs. This is how many open-source models have been bootstrapped from proprietary ones.
Pros: Works with black-box APIs. No access to logits needed. Cons: You lose the soft probability information. Quality depends heavily on the diversity and size of your prompt set.
Chain-of-Thought Distillation
A refinement of response-based distillation where you specifically capture the teacher’s reasoning chains. The student learns not just what to answer but how to think through problems.
This has proven especially effective for reasoning tasks. A 7B model trained on CoT traces from a 70B teacher can dramatically outperform a 7B model trained on answer-only data.
Feature-Based Distillation
When you have access to the teacher’s internals, you can match intermediate representations — not just final outputs. The student learns to produce similar hidden states at corresponding layers.
This requires architectural compatibility between teacher and student. Projection layers can bridge dimension mismatches, but the approach works best when architectures are similar.
Progressive Distillation
Rather than jumping from a 70B teacher to a 7B student, progressive distillation uses intermediate steps: 70B → 30B → 13B → 7B. Each step is smaller, and each student becomes the teacher for the next round.
This often produces better final models than single-step distillation, especially when the size gap between teacher and student is large.
Pruning: Removing What Doesn’t Matter
Pruning removes parameters from a trained model based on importance criteria. The insight: most neural networks are heavily over-parameterized, and many weights contribute almost nothing to outputs.
Unstructured Pruning
Set individual weights to zero based on magnitude or gradient information. A model pruned to 50% sparsity has half its weights zeroed out.
The problem: Unstructured sparsity doesn’t translate to speed improvements without specialized sparse computation libraries. A 50% sparse matrix stored in dense format takes exactly the same time to multiply.
Structured Pruning
Remove entire neurons, attention heads, or layers. This produces genuinely smaller, faster models without special hardware support.
Modern structured pruning for LLMs typically targets:
- Attention heads: Many heads are redundant. Removing 30-50% often has minimal impact on quality.
- Feed-forward dimensions: The intermediate dimension in FFN layers can often be reduced significantly.
- Entire layers: For very deep models, removing middle layers (with careful reconnection) can work surprisingly well.
The Lottery Ticket Hypothesis
Frankle and Carlin’s lottery ticket hypothesis suggests that within a large network, there exists a smaller subnetwork that — if trained from the right initialization — would match the full network’s performance. Pruning attempts to find these “winning tickets.”
In practice, iterative magnitude pruning (train → prune → retrain → prune) gets closest to finding these subnetworks.
Combining Techniques: The Compression Pipeline
In production, you rarely use just one technique. A typical compression pipeline looks like:
- Distill from a large teacher to a smaller student architecture
- Prune the student to remove redundant components
- Quantize the pruned model to lower precision (INT8 or INT4)
- Fine-tune on task-specific data to recover any lost quality
Each step trades a small amount of quality for significant efficiency gains. The compound effect is dramatic — you might go from a 70B FP16 model requiring 140GB of VRAM to a 7B INT4 model fitting in 4GB.
Practical Distillation: A Walkthrough
Let’s say you want to distill a capable 70B model into a 7B model for a customer support chatbot.
Step 1: Curate Your Training Data
Collect 50K-100K representative conversations. Run them through your teacher model to generate high-quality responses. Include diverse scenarios: simple FAQ, complex troubleshooting, edge cases, refusals.
Step 2: Generate Chain-of-Thought Traces
For complex queries, prompt the teacher to show its reasoning. This gives the student richer training signal.
Step 3: Fine-Tune the Student
Use standard supervised fine-tuning on the teacher-generated dataset. Key settings:
- Learning rate: 1e-5 to 5e-5
- Epochs: 3-5 (watch for overfitting)
- Batch size: as large as your hardware allows
Step 4: Evaluate Thoroughly
Compare student and teacher on held-out test sets. Pay attention to:
- Accuracy on simple tasks — should be nearly identical
- Quality on complex tasks — expect some degradation
- Hallucination rate — sometimes students hallucinate more
- Latency — the whole point; measure actual serving speed
Step 5: Iterate
If quality gaps are too large, try:
- More training data (especially for weak areas)
- Progressive distillation through an intermediate size
- Task-specific fine-tuning after distillation
- Mixing teacher-generated and human-labeled data
The Distillation Wars of 2026
The landscape has shifted dramatically. What started as academic research is now a competitive strategy. Major labs release frontier models, and within weeks, distilled open-source alternatives appear that match 80-90% of performance at 10% of the cost.
This creates an interesting dynamic: the value of training frontier models increasingly comes from their use as teachers, not just as products. Labs that invest billions in training a cutting-edge model see distilled competitors almost immediately.
Some responses: licensing restrictions on distillation, output detection watermarks, and API terms that prohibit training on outputs. Whether these hold up legally and practically remains an open question.
When Distillation Isn’t Enough
Distillation has limits. Some capabilities seem to require scale:
- Long-tail knowledge — Rare facts that small models can’t memorize
- Complex multi-step reasoning — Where chain-of-thought helps but doesn’t fully close the gap
- Instruction following nuance — Subtle aspects of alignment that resist compression
For these cases, consider hybrid architectures: a small model handles most requests, with routing to a larger model for complex ones. This gives you the economics of a small model with the capability ceiling of a large one.
Key Takeaways
- Distillation transfers knowledge from large models to small ones through soft probability matching or response mimicry
- Temperature is the critical hyperparameter in classical distillation — start at T=4 and sweep
- CoT distillation dramatically improves reasoning in student models
- Pruning removes redundant parameters; structured pruning gives real speedups
- Combine techniques (distill → prune → quantize) for maximum compression
- Progressive distillation through intermediate sizes often beats single-step compression
- Evaluate carefully — some capabilities degrade non-linearly with compression
The goal isn’t to make models as small as possible. It’s to make them as small as needed for your specific use case. A well-compressed model that serves your users quickly and cheaply is more valuable than a frontier model that’s too expensive to deploy.
Simplify
← Constitutional AI and Alignment: Teaching LLMs Values
Go deeper
Long-Context LLMs: How Models Actually Use 128K+ Token Windows →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.