Residual Connections: The Simple Idea That Made Deep Learning Deep
Why residual connections work, how they solve the degradation problem, their mathematical properties, and their role in everything from ResNets to transformers.
View all deep learning depths βDepth ladder for this topic:
Residual Connections: The Simple Idea That Made Deep Learning Deep
Before 2015, making neural networks deeper often made them worse. Not because of overfitting β deeper networks had higher training error. Adding layers to a network that could already represent the solution somehow made it harder to find that solution.
Residual connections fixed this. The idea is almost embarrassingly simple: instead of learning a function H(x), learn the residual F(x) = H(x) - x, and compute the output as F(x) + x.
One addition operation. It changed everything.
The Degradation Problem
Consider stacking layers in a deep network. In theory, a 56-layer network should perform at least as well as a 20-layer network β the extra 36 layers could just learn identity mappings (pass input through unchanged).
In practice, the 56-layer network performed worse on both training and test data. The optimizer couldnβt find solutions as good as the shallower network, even though better solutions existed.
This isnβt overfitting. The deeper network underfits β it has higher training loss. The problem is optimization, not capacity.
The Residual Solution
A residual block computes:
y = F(x) + x
Where F(x) is whatever the block learns (convolutions, normalization, activations) and x is the input passed directly through.
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out = out + residual # The skip connection
out = F.relu(out)
return out
The key insight: if the optimal transformation is close to identity, itβs easier to learn F(x) β 0 (push small residuals toward zero) than to learn H(x) β x (learn the identity function from scratch through multiple nonlinear layers).
Why It Works: Three Perspectives
1. Gradient Flow
During backpropagation, the gradient through a residual block is:
βy/βx = βF(x)/βx + 1
That β+1β is crucial. It means the gradient always has a component of at least 1 flowing directly through the skip connection. Even if the layersβ gradients vanish (βF/βx β 0), information still flows.
In a network with L residual blocks, the gradient from the loss to any intermediate layer passes through multiple additive paths, not a single multiplicative chain. This dramatically reduces the vanishing gradient problem.
2. Ensemble Interpretation
A remarkable observation: a network with N residual blocks behaves like an ensemble of 2^N paths of different lengths. Each residual block creates a fork β the signal can go through the block or skip it.
Experiments showed that:
- Removing a single block barely hurts performance (many other paths compensate)
- Most of the gradient flows through short paths (3-5 blocks)
- The network is resilient to individual block failures
This is fundamentally different from a plain network, where removing any layer destroys the function.
3. Loss Landscape Smoothing
Residual connections smooth the loss landscape, similar to batch normalization. The skip connections prevent the loss surface from having the sharp, irregular geometry that makes optimization difficult in deep plain networks.
Smoother landscapes mean:
- Gradient descent takes more reliable steps
- The optimizer converges faster
- Solutions generalize better (smooth minima vs. sharp minima)
Variants
Pre-activation ResNet
The original ResNet applies activation after the addition: ReLU(F(x) + x). Pre-activation ResNet applies batch norm and ReLU before the convolutions:
def forward(self, x):
residual = x
out = self.conv1(F.relu(self.bn1(x)))
out = self.conv2(F.relu(self.bn2(out)))
return out + residual
This gives cleaner gradient flow (the skip connection is a pure identity, not passing through ReLU) and slightly better performance.
Bottleneck Blocks
For deeper networks, a 1Γ1 β 3Γ3 β 1Γ1 convolution pattern reduces computation:
class Bottleneck(nn.Module):
def __init__(self, in_channels, bottleneck_channels):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1) # Reduce
self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1) # Process
self.conv3 = nn.Conv2d(bottleneck_channels, in_channels, 1) # Expand
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
return F.relu(out + x)
Dense Connections (DenseNet)
Instead of adding the input to the output, DenseNet concatenates all previous layersβ outputs. Each layer receives feature maps from all preceding layers. More feature reuse, but higher memory cost.
Transformer Residual Connections
Every transformer block uses residual connections:
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FeedForward(LayerNorm(x))
Without skip connections, transformers canβt be trained deep. The entire LLM revolution rests on residual connections enabling 100+ layer transformers.
Practical Implications
When Dimensions Donβt Match
Skip connections require input and output to have the same shape. When dimensions change (e.g., downsampling), use a projection:
if x.shape != out.shape:
x = self.projection(x) # 1x1 conv to match dimensions
return out + x
Scaling the Residual
Some architectures scale the residual before adding:
return out + alpha * x # alpha < 1 for very deep networks
This helps with initialization stability in extremely deep networks. GPT-style models scale residual connections by 1/βN where N is the number of layers.
Initialization
With residual connections, proper initialization matters less (the skip connection provides a reasonable starting point). But improper initialization of F(x) can still cause issues. Initialize the last layer of each residual block to zero so that F(x) starts at zero and the block starts as an identity:
nn.init.zeros_(self.conv2.weight)
Where Residual Connections Appear
- ResNet (2015) β the original. Enabled training of 152-layer CNNs.
- Transformers (2017) β every attention and FFN block
- U-Net β skip connections between encoder and decoder
- Diffusion models β U-Net backbone with skip connections
- LLMs β GPT, LLaMA, Gemma, Claude β all use residual connections in every layer
Itβs not an exaggeration to say that residual connections are the most important architectural innovation in deep learning. Without them, we canβt train the deep networks that power modern AI.
The lesson is humbling: sometimes the biggest breakthroughs are the simplest ideas. Just add x.
Simplify
β Deep Learning Regularization Techniques: A Practical Guide
Go deeper
From RNNs to Transformers: The Architecture Shift That Changed AI β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.