The Bias-Variance Tradeoff: Why ML Models Fail in Two Opposite Ways
The bias-variance tradeoff is the central tension in machine learning. Understanding it explains why models overfit, underfit, and how to find the sweet spot.
View all machine learning depths →Depth ladder for this topic:
If you’ve spent time building machine learning models, you’ve almost certainly encountered a model that:
- Performed brilliantly on training data but collapsed on new data
- Learned so little that it barely outperformed guessing
Both failures have a shared diagnosis: the bias-variance tradeoff. It’s one of the most important conceptual frameworks in ML — and the reason why “just train longer” or “just use a bigger model” aren’t always the answer.
What we’re actually trying to do
When we train a model, we’re trying to approximate an unknown function — call it f — that maps inputs to correct outputs. We never have access to f directly; we only have a sample of data points generated by f plus noise.
The error of our model on new, unseen data can be decomposed into three components:
Total Error = Bias² + Variance + Irreducible Noise
Let’s break each of these down.
Bias: systematic error from wrong assumptions
Bias measures how far off the model’s predictions are, on average, from the true values — regardless of which training set you use.
A high-bias model has systematically wrong assumptions about the relationship between inputs and outputs. It may:
- Use a linear model to fit an inherently nonlinear relationship
- Use too few features to capture the relevant patterns
- Apply too much regularization, preventing it from fitting the data
High bias = underfitting. The model is too simple to learn the underlying pattern.
Intuition: Imagine trying to fit a straight line through data that follows a U-shape. No matter how much data you add, the line can never capture the curve. You’re systematically wrong — and you’d be systematically wrong even with infinite data.
Variance: sensitivity to training data fluctuations
Variance measures how much the model’s predictions change when you train on different samples from the same distribution.
A high-variance model learns the training data too specifically — including its noise, outliers, and sample-specific quirks. It:
- Memorizes patterns that don’t generalize
- Fits the noise rather than the signal
- Performs dramatically differently across different training splits
High variance = overfitting. The model is too complex relative to the available data.
Intuition: Imagine fitting a 10th-degree polynomial through 12 data points. It will hit every training point perfectly. But on new data, it will oscillate wildly. It learned the training set, not the underlying function.
The tradeoff
Here’s the core tension: the techniques that reduce bias tend to increase variance, and vice versa.
Model complexity is the main lever:
| Action | Effect on Bias | Effect on Variance |
|---|---|---|
| Increase model complexity (more layers, more features) | Decreases bias | Increases variance |
| Decrease model complexity | Increases bias | Decreases variance |
| Add more training data | Little effect | Decreases variance |
| Add regularization (L1/L2, dropout) | Increases bias slightly | Decreases variance |
| Feature engineering (more relevant features) | Decreases bias | May increase variance |
| Early stopping (in neural nets) | Increases bias slightly | Decreases variance |
There is no free lunch. You’re always trading one off against the other. The goal is to find the point of minimum total error — where bias² + variance is minimized.
How to diagnose which problem you have
You can’t measure bias and variance directly in practice (that would require running infinite experiments on infinite datasets). But you can diagnose which problem you’re facing from your train/validation performance gap.
Symptom: High training error AND high validation error → High bias (underfitting) → Model hasn’t learned the training data well. It’s too simple. → Fix: More complex model, more features, less regularization
Symptom: Low training error AND high validation error → High variance (overfitting) → Model learned the training set but doesn’t generalize. → Fix: More data, regularization, dropout, simpler model, early stopping
Symptom: Both errors are low but validation creeps up over time → Overfitting emerging as training continues → Fix: Early stopping, regularization, learning rate scheduling
Symptom: Both errors are low and similar → Good generalization — you’re likely in a good region
Regularization: the primary tool for managing variance
Regularization is any technique that constrains a model to prevent overfitting. The two most common in classical ML:
L2 regularization (Ridge): Adds a penalty proportional to the sum of squared weights to the loss function. Pushes weights toward zero without zeroing them out. Makes the model prefer simpler, smoother fits.
L1 regularization (Lasso): Adds a penalty proportional to the sum of absolute weight values. Can drive weights exactly to zero, performing implicit feature selection.
In deep learning, regularization takes additional forms:
- Dropout: Randomly zeros out neuron activations during training, forcing the network to learn redundant representations
- Batch normalization: Normalizes activations across a mini-batch, which has a regularizing effect
- Weight decay: Equivalent to L2 regularization on the weights
Cross-validation: the essential diagnostic tool
Since you can’t evaluate generalization directly on the test set (that would bias your evaluation), k-fold cross-validation gives you an unbiased estimate of generalization performance.
The process:
- Split your training data into k folds (typically 5 or 10)
- Train on k-1 folds, validate on the held-out fold
- Repeat k times, each time holding out a different fold
- Average the validation scores
This gives you a robust estimate of both mean performance (related to bias) and variance across folds (related to variance). High variance across folds is a sign that your model is too sensitive to the specific data it sees.
Learning curves: visualizing the tradeoff
Learning curves plot train and validation error as a function of training set size. They’re one of the most informative diagnostic tools you have.
High-bias model pattern:
- Training error is high and doesn’t decrease much with more data
- Validation error converges toward training error at a high value
- Gap between curves is small — both are bad
- Adding data won’t help; you need a more complex model
High-variance model pattern:
- Training error is very low
- Validation error is much higher (large gap)
- As you add more data, validation error decreases (gap narrows)
- More data helps; so does regularization
If training error converges to validation error at a high level → underfitting. If they converge at a low level → you’ve found a good model. If the gap stays large → overfitting that requires regularization or more data.
The double descent phenomenon
A relatively recent finding complicates the traditional bias-variance picture: double descent.
In classical statistics, the bias-variance tradeoff predicts a U-shaped test error curve — error decreases as model complexity increases (lower bias) but then increases again (higher variance) as you overfit.
Modern deep learning shows a different pattern: test error can decrease again after the interpolation threshold — after the model is large enough to fit the training data exactly. This “modern regime” of highly overparameterized models (like large neural nets) doesn’t follow the traditional curve.
The explanation involves implicit regularization — gradient descent on overparameterized models finds “simple” solutions in a way that reduces variance even without explicit regularization. This phenomenon is still an active research area, but it’s why very large neural networks can interpolate training data and still generalize well.
This doesn’t eliminate the bias-variance tradeoff — it complicates it. For classical ML (linear models, SVMs, gradient boosting), the tradeoff works as described. For large neural nets, the picture is more nuanced.
The practical upshot
You probably won’t compute bias and variance explicitly in your projects. What you will do is:
- Measure train and validation error — always, on every experiment
- Diagnose the failure mode — underfitting vs. overfitting
- Apply the right fix — complexity vs. regularization
- Use cross-validation to get robust estimates before touching the test set
- Plot learning curves when you’re unsure what’s happening
The bias-variance tradeoff is the conceptual foundation for all of this. Once you internalize it, model debugging becomes much more systematic: you’re not guessing — you’re following the signal.
Looking for the applied version? See the 🔵 Applied series on machine learning for how to use these ideas in practice without the math.
Simplify
← Feature Engineering: The Craft That Makes ML Models Actually Work
Go deeper
Causal Inference for Machine Learning: Moving Beyond Correlation →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.