Image AI: Understanding Vision Transformers (ViTs)
A technical deep dive into Vision Transformers—how they work, why they overtook CNNs, key architectural variants, and practical considerations for deploying ViTs in production.
View all image ai depths →Depth ladder for this topic:
From CNNs to Transformers
For nearly a decade, convolutional neural networks (CNNs) dominated computer vision. AlexNet, VGG, ResNet, EfficientNet—each generation improved accuracy while building on the same core idea: learnable filters that slide across images, exploiting spatial locality and translation invariance.
Then, in 2020, Google Brain published “An Image is Worth 16x16 Words.” The Vision Transformer (ViT) applied the transformer architecture—originally designed for language—directly to images. The result challenged the assumption that convolutions were essential for visual understanding.
By 2026, transformers have become the default architecture for most vision tasks, though the story is more nuanced than “transformers replaced CNNs.”
How Vision Transformers Work
The Core Idea
ViT treats an image as a sequence of patches, just as a language transformer treats text as a sequence of tokens:
- Split the image into fixed-size patches (typically 16×16 or 14×14 pixels)
- Flatten each patch into a vector
- Project each flattened patch through a linear embedding layer
- Add positional embeddings to retain spatial information
- Prepend a learnable [CLS] token (used for classification)
- Process the sequence through standard transformer encoder layers (multi-head self-attention + MLP)
- Classify using the [CLS] token’s output representation
Patch Embedding
For a 224×224 image with 16×16 patches, you get 196 patches (14×14 grid). Each patch is a 16×16×3 = 768-dimensional vector (for RGB). A linear projection maps this to the model’s hidden dimension.
# Simplified patch embedding
patches = image.reshape(B, num_patches, patch_size*patch_size*channels)
embeddings = linear_projection(patches) # (B, 196, hidden_dim)
Why Patches Instead of Pixels?
Processing individual pixels as tokens would create sequences of 50,176 elements for a 224×224 image. Self-attention has O(n²) complexity—this is computationally prohibitive. Patches reduce the sequence length to a manageable size (196 for 16×16 patches).
Positional Embeddings
Unlike CNNs, transformers have no built-in notion of spatial position. Positional embeddings are added to patch embeddings to encode where each patch came from. Options:
- Learned absolute positions: A learnable embedding for each patch position. Simple and effective.
- 2D-aware positions: Separate row and column embeddings, combined additively
- Relative position biases: Encode relative distances between patches in the attention computation (used in Swin Transformer)
- Rotary Position Embeddings (RoPE): Adapted from NLP, increasingly used in vision transformers
Self-Attention Over Patches
Each patch attends to every other patch. This is the key advantage over CNNs: the receptive field is global from the first layer. A patch in the top-left corner can attend to a patch in the bottom-right corner without relying on stacked layers to propagate information.
The attention pattern learns task-relevant relationships. For object classification, early layers attend to local neighborhoods (mimicking convolutions), while deeper layers attend to semantically related patches across the image.
Key Architectural Variants
DeiT (Data-efficient Image Transformers)
The original ViT required massive datasets (ImageNet-21K or JFT-300M) to train well. DeiT from Facebook/Meta introduced:
- Knowledge distillation: A teacher CNN (RegNet) trains the ViT student
- Distillation token: An additional learnable token that learns from the teacher’s output
- Strong augmentation: RandAugment, Mixup, CutMix, random erasing
Result: ViT competitive with CNNs when trained on ImageNet-1K alone.
Swin Transformer
Addresses ViT’s scalability problem for high-resolution inputs and dense prediction tasks:
- Hierarchical feature maps: Like CNNs, builds features at multiple resolutions by merging patches at each stage
- Shifted window attention: Self-attention computed within local windows, with windows shifting between layers to enable cross-window connections
- Linear complexity: O(n) instead of O(n²) with respect to image size
Swin became the backbone of choice for object detection, segmentation, and other dense prediction tasks.
BEiT and MAE (Self-Supervised Pre-Training)
- BEiT: Masks patches and trains the model to predict discrete visual tokens (from a dVAE codebook). The BERT of vision.
- MAE (Masked Autoencoder): Masks a large fraction (75%) of patches and trains the model to reconstruct pixel values. Remarkably effective and computationally efficient—masked patches are dropped entirely during encoding.
These self-supervised methods reduced dependence on labeled data and became the standard pre-training approach for vision transformers.
DINOv2
Self-supervised learning without labels, producing universal visual features. DINOv2 models trained on curated unlabeled data produce features that transfer well across tasks without fine-tuning. The features are so good that a simple linear classifier on top of frozen DINOv2 features matches fine-tuned models on many benchmarks.
EVA and InternImage
Scaled vision transformers to billions of parameters, pre-trained on web-scale data. These models achieve state-of-the-art results across vision benchmarks and serve as powerful feature extractors for downstream tasks.
ViTs vs. CNNs: The Real Trade-offs
Where ViTs Win
- Large-scale pre-training: ViTs scale better with more data and compute
- Global context: Self-attention captures long-range dependencies that CNNs need deep stacking to achieve
- Transfer learning: Pre-trained ViTs transfer better across diverse tasks
- Multimodal integration: Transformer architecture unifies vision and language (CLIP, LLaVA, etc.)
- Flexibility: Same architecture for classification, detection, segmentation, generation
Where CNNs Still Compete
- Data efficiency: CNNs learn useful features from less data due to built-in inductive biases (locality, translation invariance)
- Inference speed: Convolutions are highly optimized on hardware; attention has higher constant factors
- Mobile and edge: Lightweight CNNs (MobileNet, EfficientNet) still dominate on resource-constrained devices
- Real-time applications: When latency budgets are tight, efficient CNNs often win
The Hybrid Approach
Many state-of-the-art architectures combine convolutions and attention:
- CoAtNet: Convolutional stem with transformer body
- ConvNeXt: A CNN designed with transformer-era principles (larger kernels, fewer normalization tricks) that matches ViT performance
- EfficientFormer: Designed for mobile deployment, mixing conv and attention blocks
Training Vision Transformers
Data Requirements
ViTs are data-hungry. Without sufficient data, they overfit:
- Small datasets (<10K images): Use pre-trained models, fine-tune carefully
- Medium datasets (10K-1M): Pre-train with self-supervised methods (MAE, DINO) on unlabeled data, then fine-tune
- Large datasets (>1M): Train from scratch with strong augmentation
Augmentation Is Critical
Without strong data augmentation, ViTs overfit even on ImageNet. Standard recipe:
- RandAugment (magnitude 9-15)
- Mixup (α=0.8) and CutMix (α=1.0)
- Random erasing (probability 0.25)
- Color jitter and random horizontal flip
Optimization
Standard recipe:
- Optimizer: AdamW (β₁=0.9, β₂=0.999)
- Learning rate: 1e-3 for pre-training, 1e-4 to 1e-5 for fine-tuning
- Schedule: Linear warmup (5-20 epochs) + cosine decay
- Weight decay: 0.05 for pre-training, 0.01 for fine-tuning
- Gradient clipping: Max norm 1.0
Resolution and Patch Size
- Larger patches (32×32): Fewer tokens, faster training, lower accuracy
- Smaller patches (8×8, 14×14): More tokens, slower training, higher accuracy
- Fine-tuning at higher resolution: Train at 224×224, fine-tune at 384×384 or 512×512 for improved accuracy. Interpolate positional embeddings to handle the new sequence length.
Deploying Vision Transformers
Inference Optimization
- Quantization: INT8 quantization reduces model size and speeds up inference with minimal accuracy loss. FP16 is standard for GPU inference.
- Token pruning: Remove uninformative tokens during inference. Many patches (background, uniform regions) contribute little to the prediction.
- Knowledge distillation: Train a smaller ViT or CNN student from a large ViT teacher
- Flash Attention: Optimized attention kernels reduce memory usage and latency
- ONNX/TensorRT: Export and optimize for production serving
Cost Considerations
ViTs are more expensive to serve than equivalent CNNs for the same accuracy level. For high-throughput, cost-sensitive applications (processing millions of images daily), this matters. Profile your specific workload before committing to an architecture.
Batch Processing vs. Real-Time
- Batch processing (image search indexing, content moderation): Use the largest, most accurate model you can afford. Throughput matters more than latency.
- Real-time (video analysis, AR/VR, autonomous vehicles): Latency is critical. Consider smaller models, token pruning, or hybrid architectures.
The Vision Transformer Ecosystem in 2026
The landscape has matured significantly:
- Foundation models (DINOv2, EVA, InternViT) provide powerful pre-trained features that transfer across tasks
- Multimodal models (CLIP, SigLIP) connect vision and language through shared transformer architectures
- Video transformers (VideoMAE, TimeSformer) extend the patch-based approach to temporal sequences
- 3D vision transformers process point clouds and voxels for robotics and spatial computing
ViTs didn’t kill CNNs—they expanded the frontier of what’s possible in computer vision. The best practitioners in 2026 understand both architectures and choose based on their specific constraints: data, compute, latency, and deployment environment.
Simplify
← How Diffusion Models Work: The Science Behind AI Image Generation
Go deeper
Image AI — Frontier Research and Unresolved Problems →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.