🟣 Technical 11 min read

Image AI: Understanding Vision Transformers (ViTs)

A technical deep dive into Vision Transformers—how they work, why they overtook CNNs, key architectural variants, and practical considerations for deploying ViTs in production.

View all image ai depths →

From CNNs to Transformers

For nearly a decade, convolutional neural networks (CNNs) dominated computer vision. AlexNet, VGG, ResNet, EfficientNet—each generation improved accuracy while building on the same core idea: learnable filters that slide across images, exploiting spatial locality and translation invariance.

Then, in 2020, Google Brain published “An Image is Worth 16x16 Words.” The Vision Transformer (ViT) applied the transformer architecture—originally designed for language—directly to images. The result challenged the assumption that convolutions were essential for visual understanding.

By 2026, transformers have become the default architecture for most vision tasks, though the story is more nuanced than “transformers replaced CNNs.”

How Vision Transformers Work

The Core Idea

ViT treats an image as a sequence of patches, just as a language transformer treats text as a sequence of tokens:

  1. Split the image into fixed-size patches (typically 16×16 or 14×14 pixels)
  2. Flatten each patch into a vector
  3. Project each flattened patch through a linear embedding layer
  4. Add positional embeddings to retain spatial information
  5. Prepend a learnable [CLS] token (used for classification)
  6. Process the sequence through standard transformer encoder layers (multi-head self-attention + MLP)
  7. Classify using the [CLS] token’s output representation

Patch Embedding

For a 224×224 image with 16×16 patches, you get 196 patches (14×14 grid). Each patch is a 16×16×3 = 768-dimensional vector (for RGB). A linear projection maps this to the model’s hidden dimension.

# Simplified patch embedding
patches = image.reshape(B, num_patches, patch_size*patch_size*channels)
embeddings = linear_projection(patches)  # (B, 196, hidden_dim)

Why Patches Instead of Pixels?

Processing individual pixels as tokens would create sequences of 50,176 elements for a 224×224 image. Self-attention has O(n²) complexity—this is computationally prohibitive. Patches reduce the sequence length to a manageable size (196 for 16×16 patches).

Positional Embeddings

Unlike CNNs, transformers have no built-in notion of spatial position. Positional embeddings are added to patch embeddings to encode where each patch came from. Options:

  • Learned absolute positions: A learnable embedding for each patch position. Simple and effective.
  • 2D-aware positions: Separate row and column embeddings, combined additively
  • Relative position biases: Encode relative distances between patches in the attention computation (used in Swin Transformer)
  • Rotary Position Embeddings (RoPE): Adapted from NLP, increasingly used in vision transformers

Self-Attention Over Patches

Each patch attends to every other patch. This is the key advantage over CNNs: the receptive field is global from the first layer. A patch in the top-left corner can attend to a patch in the bottom-right corner without relying on stacked layers to propagate information.

The attention pattern learns task-relevant relationships. For object classification, early layers attend to local neighborhoods (mimicking convolutions), while deeper layers attend to semantically related patches across the image.

Key Architectural Variants

DeiT (Data-efficient Image Transformers)

The original ViT required massive datasets (ImageNet-21K or JFT-300M) to train well. DeiT from Facebook/Meta introduced:

  • Knowledge distillation: A teacher CNN (RegNet) trains the ViT student
  • Distillation token: An additional learnable token that learns from the teacher’s output
  • Strong augmentation: RandAugment, Mixup, CutMix, random erasing

Result: ViT competitive with CNNs when trained on ImageNet-1K alone.

Swin Transformer

Addresses ViT’s scalability problem for high-resolution inputs and dense prediction tasks:

  • Hierarchical feature maps: Like CNNs, builds features at multiple resolutions by merging patches at each stage
  • Shifted window attention: Self-attention computed within local windows, with windows shifting between layers to enable cross-window connections
  • Linear complexity: O(n) instead of O(n²) with respect to image size

Swin became the backbone of choice for object detection, segmentation, and other dense prediction tasks.

BEiT and MAE (Self-Supervised Pre-Training)

  • BEiT: Masks patches and trains the model to predict discrete visual tokens (from a dVAE codebook). The BERT of vision.
  • MAE (Masked Autoencoder): Masks a large fraction (75%) of patches and trains the model to reconstruct pixel values. Remarkably effective and computationally efficient—masked patches are dropped entirely during encoding.

These self-supervised methods reduced dependence on labeled data and became the standard pre-training approach for vision transformers.

DINOv2

Self-supervised learning without labels, producing universal visual features. DINOv2 models trained on curated unlabeled data produce features that transfer well across tasks without fine-tuning. The features are so good that a simple linear classifier on top of frozen DINOv2 features matches fine-tuned models on many benchmarks.

EVA and InternImage

Scaled vision transformers to billions of parameters, pre-trained on web-scale data. These models achieve state-of-the-art results across vision benchmarks and serve as powerful feature extractors for downstream tasks.

ViTs vs. CNNs: The Real Trade-offs

Where ViTs Win

  • Large-scale pre-training: ViTs scale better with more data and compute
  • Global context: Self-attention captures long-range dependencies that CNNs need deep stacking to achieve
  • Transfer learning: Pre-trained ViTs transfer better across diverse tasks
  • Multimodal integration: Transformer architecture unifies vision and language (CLIP, LLaVA, etc.)
  • Flexibility: Same architecture for classification, detection, segmentation, generation

Where CNNs Still Compete

  • Data efficiency: CNNs learn useful features from less data due to built-in inductive biases (locality, translation invariance)
  • Inference speed: Convolutions are highly optimized on hardware; attention has higher constant factors
  • Mobile and edge: Lightweight CNNs (MobileNet, EfficientNet) still dominate on resource-constrained devices
  • Real-time applications: When latency budgets are tight, efficient CNNs often win

The Hybrid Approach

Many state-of-the-art architectures combine convolutions and attention:

  • CoAtNet: Convolutional stem with transformer body
  • ConvNeXt: A CNN designed with transformer-era principles (larger kernels, fewer normalization tricks) that matches ViT performance
  • EfficientFormer: Designed for mobile deployment, mixing conv and attention blocks

Training Vision Transformers

Data Requirements

ViTs are data-hungry. Without sufficient data, they overfit:

  • Small datasets (<10K images): Use pre-trained models, fine-tune carefully
  • Medium datasets (10K-1M): Pre-train with self-supervised methods (MAE, DINO) on unlabeled data, then fine-tune
  • Large datasets (>1M): Train from scratch with strong augmentation

Augmentation Is Critical

Without strong data augmentation, ViTs overfit even on ImageNet. Standard recipe:

  • RandAugment (magnitude 9-15)
  • Mixup (α=0.8) and CutMix (α=1.0)
  • Random erasing (probability 0.25)
  • Color jitter and random horizontal flip

Optimization

Standard recipe:

  • Optimizer: AdamW (β₁=0.9, β₂=0.999)
  • Learning rate: 1e-3 for pre-training, 1e-4 to 1e-5 for fine-tuning
  • Schedule: Linear warmup (5-20 epochs) + cosine decay
  • Weight decay: 0.05 for pre-training, 0.01 for fine-tuning
  • Gradient clipping: Max norm 1.0

Resolution and Patch Size

  • Larger patches (32×32): Fewer tokens, faster training, lower accuracy
  • Smaller patches (8×8, 14×14): More tokens, slower training, higher accuracy
  • Fine-tuning at higher resolution: Train at 224×224, fine-tune at 384×384 or 512×512 for improved accuracy. Interpolate positional embeddings to handle the new sequence length.

Deploying Vision Transformers

Inference Optimization

  • Quantization: INT8 quantization reduces model size and speeds up inference with minimal accuracy loss. FP16 is standard for GPU inference.
  • Token pruning: Remove uninformative tokens during inference. Many patches (background, uniform regions) contribute little to the prediction.
  • Knowledge distillation: Train a smaller ViT or CNN student from a large ViT teacher
  • Flash Attention: Optimized attention kernels reduce memory usage and latency
  • ONNX/TensorRT: Export and optimize for production serving

Cost Considerations

ViTs are more expensive to serve than equivalent CNNs for the same accuracy level. For high-throughput, cost-sensitive applications (processing millions of images daily), this matters. Profile your specific workload before committing to an architecture.

Batch Processing vs. Real-Time

  • Batch processing (image search indexing, content moderation): Use the largest, most accurate model you can afford. Throughput matters more than latency.
  • Real-time (video analysis, AR/VR, autonomous vehicles): Latency is critical. Consider smaller models, token pruning, or hybrid architectures.

The Vision Transformer Ecosystem in 2026

The landscape has matured significantly:

  • Foundation models (DINOv2, EVA, InternViT) provide powerful pre-trained features that transfer across tasks
  • Multimodal models (CLIP, SigLIP) connect vision and language through shared transformer architectures
  • Video transformers (VideoMAE, TimeSformer) extend the patch-based approach to temporal sequences
  • 3D vision transformers process point clouds and voxels for robotics and spatial computing

ViTs didn’t kill CNNs—they expanded the frontier of what’s possible in computer vision. The best practitioners in 2026 understand both architectures and choose based on their specific constraints: data, compute, latency, and deployment environment.

Simplify

← How Diffusion Models Work: The Science Behind AI Image Generation

Go deeper

Image AI — Frontier Research and Unresolved Problems →

Related reads

image-aivision-transformersvitcomputer-visionarchitecture

Stay ahead of the AI curve

Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.

No spam. Unsubscribe anytime. We respect your inbox.