Activation Functions: Why Neural Networks Need Nonlinearity

Here’s a surprising fact: a neural network with 100 layers but no activation functions is mathematically equivalent to a single-layer network. All those layers of matrix multiplication collapse into one. Activation functions are what prevent this collapse — they’re the ingredient that makes deep learning deep.

The Core Idea

A neural network layer computes: output = activation(weights × input + bias)

Without the activation function, each layer is a linear transformation. Stacking linear transformations gives you… another linear transformation. The activation function introduces nonlinearity, which means each additional layer can represent increasingly complex patterns.

Think of it this way: linear functions can only draw straight lines (or hyperplanes). Real-world data — images, language, sound — has complex, curved decision boundaries. Activation functions let neural networks bend those boundaries.

The Important Activation Functions

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

The most widely used activation function in deep learning. It’s dead simple: negative values become zero, positive values pass through unchanged.

Why it works: Fast to compute, doesn’t saturate for positive values (so gradients flow well), and introduces sufficient nonlinearity. Networks with ReLU train faster than those with sigmoid or tanh.

The problem: “Dying ReLU” — if a neuron’s output is consistently negative, its gradient is always zero and it stops learning entirely. In practice, 10–20% of neurons can die during training.

Leaky ReLU and Variants

f(x) = x if x > 0, else α × x  (where α is small, like 0.01)

Fixes the dying ReLU problem by allowing a small gradient for negative values. PReLU makes α learnable. ELU uses an exponential curve for negative values, producing smoother gradients.

In practice, Leaky ReLU or its variants are a safe default choice for most architectures.

GELU (Gaussian Error Linear Unit)

f(x) = x × Φ(x)  (where Φ is the standard normal CDF)

The activation function behind transformers. Used in BERT, GPT, and most modern language models. Unlike ReLU’s hard cutoff at zero, GELU smoothly curves near zero, which appears to help optimization in attention-based architectures.

If you’re building anything transformer-based, GELU is the standard choice.

SiLU / Swish

f(x) = x × σ(x)  (where σ is the sigmoid function)

Discovered through automated search by Google. Similar to GELU in shape but slightly different in practice. Used in many vision models (EfficientNet, some diffusion models). Self-gated — the input modulates itself.

Sigmoid

f(x) = 1 / (1 + e^(-x))

Maps any value to the range (0, 1). Was the original activation function for neural networks. Rarely used in hidden layers today because of the vanishing gradient problem — for large positive or negative inputs, the gradient is nearly zero, making deep networks hard to train.

Still used in output layers for binary classification (produces a probability) and in gates within LSTM and GRU cells.

Softmax

f(x_i) = e^(x_i) / Σ(e^(x_j))

Not an element-wise function — it operates on a vector and produces a probability distribution. Used almost exclusively as the final activation for multi-class classification. Ensures outputs sum to 1 and are all positive.

How to Choose

Architecture	Recommended
Transformer / attention	GELU
CNN (classification)	ReLU or Leaky ReLU
CNN (generation/diffusion)	SiLU/Swish
RNN/LSTM gates	Sigmoid + Tanh
Output (binary)	Sigmoid
Output (multi-class)	Softmax
Output (regression)	None (linear)
Default / unsure	Leaky ReLU

Impact on Training

Activation functions affect training dynamics more than most people realize:

Gradient flow — Saturating functions (sigmoid, tanh) compress gradients for large inputs. This is why deep networks with sigmoid activations were nearly impossible to train before ReLU was popularized. Modern activations are designed to maintain gradient magnitude across layers.

Initialization sensitivity — The choice of activation function determines the optimal weight initialization. ReLU networks use He initialization (variance = 2/n). Sigmoid/tanh networks use Xavier/Glorot initialization (variance = 1/n). Using the wrong initialization with a given activation can make training fail from the start.

Training speed — ReLU and its variants compute faster than sigmoid or GELU because they avoid exponentials. For large models, this computational savings adds up. GELU is slightly more expensive but the quality improvement for transformers justifies the cost.

The Practical Reality

For most practitioners, activation function choice is a solved problem:

Use whatever the reference architecture uses (GELU for transformers, ReLU for CNNs)
If training shows dying neurons, switch to Leaky ReLU
Don’t overthink it — the difference between modern activation functions is usually less than 1% on final performance

The activation function is important to understand because it’s fundamental to how neural networks work. But it’s rarely the thing you need to optimize. Your time is better spent on data quality, architecture design, and training procedures.