Batch Normalization: Why It Works and When It Doesn’t

Batch normalization (BatchNorm) is one of the most influential techniques in deep learning. Introduced in 2015, it made training deep networks dramatically easier. Almost overnight, practitioners could train networks that previously wouldn’t converge.

But the why behind BatchNorm is murkier than most people realize. The original explanation was wrong (or at least incomplete). And BatchNorm has real limitations that have driven the development of alternatives.

What BatchNorm Does

The mechanics are straightforward. For each mini-batch during training:

Compute batch statistics — mean (μ) and variance (σ²) of activations across the batch dimension
Normalize — subtract mean, divide by standard deviation
Scale and shift — apply learned parameters γ (scale) and β (shift)

def batch_norm(x, gamma, beta, eps=1e-5):
    # x shape: (batch_size, features)
    mean = x.mean(dim=0)
    var = x.var(dim=0)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return gamma * x_norm + beta

The γ and β parameters are learned during training. They let the network undo the normalization if that’s optimal — the network has at least as much representational power as without BatchNorm.

At inference time, you don’t have a batch to compute statistics from. Instead, you use running averages of mean and variance accumulated during training.

The Original Explanation (And Why It’s Wrong)

The 2015 paper argued BatchNorm works by reducing internal covariate shift — the idea that as earlier layers change during training, the input distribution to later layers shifts, making them chase a moving target.

By normalizing each layer’s inputs, you stabilize these distributions, letting each layer learn independently.

It’s an intuitive story. It’s also not well-supported by evidence.

A 2018 paper (“How Does Batch Normalization Help Optimization?”) showed:

Adding random noise after BatchNorm (deliberately re-introducing covariate shift) barely hurts performance
Networks without internal covariate shift don’t necessarily train faster
BatchNorm’s main effect is smoothing the loss landscape

What Actually Happens

The current understanding is that BatchNorm primarily:

Smooths the Optimization Landscape

BatchNorm makes the loss function smoother — fewer sharp peaks, gentler gradients, more predictable optimization. This means:

Gradient descent takes more reliable steps
You can use larger learning rates without diverging
The optimizer is less likely to get stuck in sharp local minima

Provides Implicit Regularization

Each sample sees different batch statistics depending on which other samples are in its mini-batch. This noise acts like a mild regularizer — similar to dropout but through a different mechanism.

This is why BatchNorm sometimes reduces the need for dropout, and why it works differently with different batch sizes.

Enables Higher Learning Rates

This is perhaps the most practical benefit. Without BatchNorm, you often need small learning rates to avoid divergence. With it, you can safely use 10x or even 100x larger learning rates, dramatically speeding up training.

When BatchNorm Fails

Small Batch Sizes

BatchNorm estimates population statistics from the batch. With batch size 2, those estimates are terrible. Rule of thumb:

Batch size ≥ 32: BatchNorm works well
Batch size 8–31: Performance degrades
Batch size < 8: Use something else

This is a real constraint for tasks requiring large models where batch size is limited by GPU memory.

Sequence Models (RNNs, Transformers)

BatchNorm normalizes across the batch dimension. In sequence models, this means normalizing across different positions in different sequences — mixing statistics from token 1 of sequence A with token 50 of sequence B. This is semantically wrong.

That’s why transformers use Layer Normalization instead:

def layer_norm(x, gamma, beta, eps=1e-5):
    # Normalize across features (not batch)
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return gamma * x_norm + beta

LayerNorm normalizes each sample independently — no batch dependency.

Training-Inference Mismatch

Running statistics accumulated during training may not match the data seen at inference. This causes subtle bugs:

Fine-tuning on a small dataset → running statistics from pre-training dominate
Domain shift → training statistics don’t apply
Small datasets → running statistics are noisy

Distributed Training

When training across multiple GPUs, should BatchNorm compute statistics per-GPU or across all GPUs? Per-GPU means each GPU sees different batch statistics. Synchronized BatchNorm fixes this but adds communication overhead.

The Alternatives

Method	Normalizes Across	Batch Dependent?	Best For
BatchNorm	Batch	Yes	CNNs, large batches
LayerNorm	Features	No	Transformers, RNNs
GroupNorm	Feature groups	No	CNNs, small batches
InstanceNorm	Spatial (per sample)	No	Style transfer
RMSNorm	Features (no centering)	No	LLMs (simpler LayerNorm)

GroupNorm splits channels into groups and normalizes within each group. Works with any batch size. Slightly worse than BatchNorm with large batches, much better with small ones.

RMSNorm (Root Mean Square Normalization) skips the centering step of LayerNorm — just divides by the RMS. Used in LLaMA, Gemma, and other recent LLMs because it’s simpler and empirically just as effective.

Practical Guidelines

CNNs with batch size ≥ 32 → BatchNorm (still the default)
Transformers → LayerNorm or RMSNorm
Small batch sizes → GroupNorm
Style transfer / image generation → InstanceNorm
If you’re unsure → LayerNorm (works everywhere, no batch dependency)

When debugging a model that won’t train:

Check that BatchNorm is in training mode during training and eval mode during inference
Verify running statistics are being updated (not frozen from a pretrained checkpoint)
Consider whether your effective batch size is too small

The Takeaway

BatchNorm is a hack that works phenomenally well for reasons we didn’t fully understand when we started using it. The original explanation was elegant but incomplete. The real picture — loss landscape smoothing and implicit regularization — is less tidy but more accurate.

Use BatchNorm where it works (CNNs, large batches). Use its relatives where it doesn’t. And remember that the “right” normalization technique is the one that makes your specific model train well — theory provides guidance, but empirical results decide.

Batch Normalization: Why It Works and When It Doesn't