Attention Mechanisms in Deep Learning: A Visual Guide

Attention is arguably the single most important mechanism in modern deep learning. It’s what allows transformers to process sequences efficiently, capture long-range dependencies, and scale to billions of parameters. This guide explains how it works, building from intuition to implementation.

The Core Intuition

When you read a sentence like “The cat sat on the mat because it was tired,” you instantly know that “it” refers to “the cat.” You’re attending to the relevant earlier word to resolve the reference.

Attention in neural networks does something similar: for each element in a sequence, it computes a relevance score against every other element and uses those scores to create a weighted combination. Elements that are more relevant get more weight.

Before attention, recurrent networks (RNNs, LSTMs) processed sequences one step at a time, passing information through a hidden state. This bottleneck meant distant information got diluted or lost. Attention lets every position directly look at every other position, solving the long-range dependency problem.

Queries, Keys, and Values

The attention mechanism uses three concepts borrowed from information retrieval:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide?

For each position in the sequence, the model creates a query vector, a key vector, and a value vector by multiplying the input embedding by learned weight matrices.

Q = X · W_Q
K = X · W_K
V = X · W_V

The attention computation then:

Compares each query against all keys (dot product)
Scales the result (divide by √d_k)
Applies softmax to get weights that sum to 1
Multiplies weights by values to get the output

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Step by Step Example

Consider three words: “The cat sleeps”

Each word has an embedding vector. Let’s say they’re 4-dimensional (real models use 768 or more):

The  → [0.1, 0.3, 0.5, 0.2]
cat  → [0.8, 0.1, 0.4, 0.9]
sleeps → [0.3, 0.7, 0.2, 0.6]

Step 1: Create Q, K, V

Each embedding gets multiplied by learned weight matrices to produce query, key, and value vectors. These matrices are the learnable parameters of the attention layer.

Step 2: Compute attention scores

For the word “sleeps” as the query, compute dot products with all keys:

score(sleeps, The)  = q_sleeps · k_The  = 2.1
score(sleeps, cat)  = q_sleeps · k_cat  = 5.7
score(sleeps, sleeps) = q_sleeps · k_sleeps = 3.3

Step 3: Scale and softmax

Divide by √d_k and apply softmax:

weights = softmax([2.1, 5.7, 3.3] / √4)
       = softmax([1.05, 2.85, 1.65])
       ≈ [0.10, 0.65, 0.25]

“sleeps” attends most strongly to “cat” (weight 0.65), which makes sense — the subject is the most relevant context for the verb.

Step 4: Weighted sum of values

output_sleeps = 0.10 · v_The + 0.65 · v_cat + 0.25 · v_sleeps

The output for “sleeps” is now a blend of information from all positions, weighted by relevance.

Why Scale by √d_k?

Without scaling, dot products between high-dimensional vectors can be very large. Large values pushed through softmax produce extremely peaked distributions — almost one-hot, where only one position gets all the attention. Dividing by √d_k keeps values in a range where softmax produces useful, distributed weights.

Self-Attention vs Cross-Attention

Self-attention: Q, K, and V all come from the same sequence. Every position attends to every other position in the same input. This is what transformer encoder layers use.

Cross-attention: Q comes from one sequence, K and V from another. Used in encoder-decoder architectures where the decoder attends to the encoder’s output. Also used in multimodal models where text attends to image features.

Multi-Head Attention

A single attention head captures one type of relationship. Multi-head attention runs several attention computations in parallel, each with its own Q, K, V weight matrices:

head_i = Attention(X · W_Q_i, X · W_K_i, X · W_V_i)
MultiHead = Concat(head_1, ..., head_h) · W_O

Different heads learn to attend to different things:

One head might capture syntactic relationships (subject-verb)
Another might capture semantic similarity
Another might track positional proximity
Another might handle coreference (pronoun resolution)

GPT-4 class models typically use 32-128 attention heads per layer, each operating on a subset of the embedding dimensions.

Attention Patterns

When you visualize attention weights as heatmaps, recognizable patterns emerge:

Diagonal pattern: Each token attends to itself or nearby positions. Common in lower layers that process local features.

Vertical stripes: Many tokens attend to the same position. Often seen for important structural tokens like [CLS] or sentence-initial words.

Block diagonal: Groups of tokens attend within clusters. Common in structured text like lists or code.

Long-range connections: Specific tokens attend to distant tokens. This is where attention shines — capturing dependencies that RNNs struggle with.

The Quadratic Problem

Standard attention computes pairwise scores for every position against every other position. For a sequence of length n, that’s n² comparisons. A 4,096-token sequence requires ~16 million attention computations per layer.

This quadratic scaling is why context windows were historically limited. It’s also why there’s been intense research into efficient attention variants:

Sparse attention (BigBird, Longformer): Only compute attention for a subset of position pairs — local windows plus selected global positions.

Linear attention (Performer, RWKV): Approximate the softmax attention with kernel functions that can be computed in linear time.

Flash Attention: Doesn’t change the computation, but restructures memory access patterns to dramatically speed up standard attention on GPUs.

Ring Attention: Distributes attention computation across multiple devices for very long sequences.

Grouped Query Attention (GQA)

Standard multi-head attention assigns separate K and V matrices to each head. GQA groups heads together, sharing K and V across groups. This reduces memory usage (especially during inference) with minimal quality loss.

Llama 2 70B and most large models since 2023 use GQA. It’s a practical optimization that lets you run more attention heads without proportionally increasing memory requirements.

Attention in Practice

Why Attention Works So Well

Parallelizable — unlike RNNs, all positions can be computed simultaneously
Direct connections — every position can directly access every other, no information bottleneck
Learnable relevance — the model learns what’s relevant, not the programmer
Composable — stacking layers creates hierarchical attention patterns

Where Attention Struggles

Fixed context window — can only attend to what’s in the current window
No inherent position sense — needs positional encodings to understand order
Quadratic cost — limits practical context lengths without approximations
Attention sinks — models sometimes waste attention on irrelevant tokens (like the first token) due to softmax mechanics

Beyond Standard Attention

Research continues to push attention in new directions:

Mixture of Attention — different layers use different attention patterns
State space models (Mamba) — replace attention with recurrent mechanisms that scale linearly while maintaining quality
Differential attention — computing the difference between two attention patterns to reduce noise
Native long-context — architectures designed from scratch for very long sequences

Attention isn’t the final word in sequence modeling, but it’s the current foundation. Understanding it well gives you the vocabulary to understand what comes next.