Dimensionality Reduction: PCA, t-SNE, UMAP, and When to Use Each

High-dimensional data is everywhere in AI. Word embeddings have 768+ dimensions. Image features span thousands. Genomic data can have millions. Humans can visualize three dimensions, maybe four with color. Dimensionality reduction bridges this gap.

But it’s not just about visualization. Reducing dimensions can improve model performance (by removing noise), speed up computation, and reveal structure that’s invisible in the original space.

The catch: every dimensionality reduction technique makes trade-offs. Understanding those trade-offs is the difference between insight and misleading pretty pictures.

The Curse of Dimensionality

Before diving into techniques, understand why high dimensions are problematic.

In high-dimensional spaces, intuitions break down:

Distances converge. As dimensions increase, the ratio of the nearest to farthest neighbor approaches 1. Everything is roughly equally far from everything else.
Volume concentrates on the surface. In a 100-dimensional unit sphere, virtually all the volume is in a thin shell near the surface.
Data becomes sparse. To maintain the same density as 10 points in 1D, you need 10^100 points in 100D.

These aren’t abstract concerns. They directly impact k-nearest-neighbor algorithms, clustering, and density estimation.

PCA: Principal Component Analysis

PCA is the workhorse of dimensionality reduction. It finds the directions of maximum variance in your data and projects onto them.

How It Works

Center the data (subtract the mean)
Compute the covariance matrix
Find eigenvalues and eigenvectors of the covariance matrix
Sort eigenvectors by eigenvalue (largest = most variance)
Project data onto the top-k eigenvectors

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# How much variance is explained?
print(pca.explained_variance_ratio_)
# e.g., [0.72, 0.15] → first two components explain 87%

Strengths

Linear and deterministic — same input always gives same output
Fast — O(min(n²p, np²)) where n=samples, p=features
Interpretable — each component is a linear combination of original features
Preserves global structure — maximizes variance, which often preserves distances
Explained variance — tells you exactly how much information each component captures

Weaknesses

Only captures linear relationships — misses curves, manifolds, nonlinear structure
Sensitive to scale — always standardize features first
Assumes variance = importance — not always true (noise can have high variance)

When to Use PCA

Preprocessing before modeling (reduce noise, speed up training)
Quick exploration of high-dimensional data
When you need reproducibility — no random initialization
When global structure matters more than local neighborhoods
Feature compression where interpretability matters

Variants Worth Knowing

Kernel PCA: Applies the kernel trick for nonlinear dimensionality reduction. Choose RBF, polynomial, or custom kernels.
Incremental PCA: Processes data in mini-batches for large datasets that don’t fit in memory.
Sparse PCA: Produces components with few non-zero coefficients for interpretability.

t-SNE: t-Distributed Stochastic Neighbor Embedding

t-SNE is designed specifically for visualization. It preserves local structure — nearby points in high dimensions stay nearby in low dimensions.

How It Works

Compute pairwise similarities in high-dimensional space (using Gaussian kernels)
Define pairwise similarities in low-dimensional space (using t-distributions)
Minimize the KL divergence between these two distributions via gradient descent

The t-distribution in the low-dimensional space has heavier tails than a Gaussian, which helps prevent the “crowding problem” — where points in the center of a cluster get crushed together.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)

The Perplexity Parameter

Perplexity roughly controls the number of effective neighbors each point considers. It’s the most important hyperparameter:

Low perplexity (5-10): Focuses on very local structure. May fragment clusters.
Medium perplexity (30-50): Good default. Balances local and semi-global structure.
High perplexity (100+): More global structure, but can wash out fine detail.

Critical rule: Always try multiple perplexity values. A single t-SNE plot can be misleading.

Common Misinterpretations

t-SNE is one of the most commonly misused tools in data science. Key warnings:

Cluster sizes are meaningless. t-SNE can expand dense clusters and compress sparse ones. Don’t interpret cluster size as density.
Distances between clusters are meaningless. Two clusters being far apart in t-SNE doesn’t mean they’re far apart in the original space.
Random seeds matter. Different initializations can produce qualitatively different plots. Run multiple times.
Don’t use t-SNE features for modeling. The embedding is optimized for visualization, not for preserving properties useful for downstream tasks.
It’s non-parametric. You can’t transform new points without rerunning on the full dataset.

When to Use t-SNE

Visualizing clusters in high-dimensional data
Exploring embeddings (word2vec, BERT features, image features)
Quality checking clusters found by other algorithms
When you need beautiful, publication-ready scatter plots of high-dimensional data

UMAP: Uniform Manifold Approximation and Projection

UMAP is the newer alternative to t-SNE. It’s faster, preserves more global structure, and can be used for more than just visualization.

How It Works (Simplified)

UMAP constructs a topological representation of the high-dimensional data using fuzzy simplicial sets (a generalization of graphs), then optimizes a low-dimensional representation to have the most similar topological structure.

In practice:

Build a weighted k-nearest-neighbor graph
Construct fuzzy simplicial sets from this graph
Optimize the low-dimensional layout to minimize cross-entropy between the high-dimensional and low-dimensional fuzzy sets

import umap

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
X_embedded = reducer.fit_transform(X)

Key Parameters

n_neighbors: Similar to perplexity. Controls local vs. global structure. Low values (5-15) emphasize local structure; high values (50-200) capture more global patterns.
min_dist: Controls how tightly points cluster. Low values (0.0-0.1) create tight clusters; high values (0.5-1.0) spread points more evenly.
metric: Distance metric in the original space. Euclidean is default, but cosine, Manhattan, and custom metrics are supported.

UMAP vs. t-SNE

Aspect	t-SNE	UMAP
Speed	Slow (O(n²) or O(n log n) with BH)	Fast (10-100x faster)
Global structure	Poor	Better
Scalability	Struggles above 100K points	Handles millions
Reproducibility	Variable	More consistent
New data points	Must recompute everything	Can transform new points
Non-2D embeddings	Theoretically possible, rarely used	Works well in any dimension
Downstream use	Visualization only	Can use for clustering, classification

When to Use UMAP

Large datasets where t-SNE is too slow
When global structure matters — UMAP preserves it better
As a preprocessing step before clustering (UMAP + HDBSCAN is a powerful combination)
When you need to embed new points without recomputing
Non-visualization dimensionality reduction (reducing to 10-50 dims for downstream models)

Decision Framework

Here’s how to choose:

Use PCA when:

You want a quick, interpretable overview
Preprocessing for another algorithm
Data relationships are roughly linear
You need explained variance statistics
Reproducibility is critical

Use t-SNE when:

You’re making publication-quality visualizations
Dataset is small-medium (<50K points)
You only care about local neighborhood structure
You want the most visually distinct clusters

Use UMAP when:

Dataset is large (50K+ points)
You want both local and global structure
You might use the reduced dimensions for downstream tasks
You need to transform new data points
Speed matters

The Practical Sequence

For exploratory analysis, use all three:

PCA first — quick sanity check, see explained variance, identify outliers
UMAP — faster, preserves more structure, good for initial cluster exploration
t-SNE — if you need the most visually crisp cluster separation for a specific subset

Advanced Techniques

Parametric UMAP

Train a neural network to learn the UMAP mapping. Benefits: GPU-accelerated, handles new points natively, can be incorporated into end-to-end models.

Supervised Dimensionality Reduction

Both UMAP and some PCA variants accept label information:

reducer = umap.UMAP(n_components=2)
X_embedded = reducer.fit_transform(X, y=labels)  # semi-supervised

This biases the embedding to separate classes, which can dramatically improve visualization and downstream clustering.

TriMAP and PaCMAP

Newer methods (2020-2023) that explicitly balance local and global structure preservation. Worth trying if UMAP’s global structure preservation isn’t sufficient for your use case.

Common Pitfalls

Not standardizing before PCA. Features on different scales will dominate the first components. Always StandardScaler first.
Over-interpreting t-SNE. Repeat it: cluster sizes and distances between clusters are meaningless in t-SNE.
Using default parameters. All three methods benefit from parameter tuning. Defaults are reasonable but rarely optimal.
Forgetting the Johnson-Lindenstrauss lemma. For distance-preserving reduction, random projections can be surprisingly effective and much faster.
Visualizing without context. A beautiful 2D plot is useless without coloring by meaningful labels or features.

Key Takeaways

PCA is fast, linear, interpretable, and preserves global variance
t-SNE creates beautiful visualizations with clear cluster separation but doesn’t preserve global structure
UMAP is the modern default — fast, preserves both local and global structure, works beyond visualization
Always try multiple parameter settings and don’t over-interpret any single plot
PCA + UMAP is a powerful combination: PCA to 50-100 dims first, then UMAP to 2D
The best technique depends on your goal: exploration, visualization, preprocessing, or downstream modeling

Dimensionality Reduction: PCA, t-SNE, UMAP, and When to Use Each

Dimensionality Reduction: PCA, t-SNE, UMAP, and When to Use Each

The Curse of Dimensionality

PCA: Principal Component Analysis

How It Works

Strengths

Weaknesses

When to Use PCA

Variants Worth Knowing

t-SNE: t-Distributed Stochastic Neighbor Embedding

How It Works

The Perplexity Parameter

Common Misinterpretations

When to Use t-SNE

UMAP: Uniform Manifold Approximation and Projection

How It Works (Simplified)

Key Parameters

UMAP vs. t-SNE

When to Use UMAP

Decision Framework

Use PCA when:

Use t-SNE when:

Use UMAP when:

The Practical Sequence

Advanced Techniques

Parametric UMAP

Supervised Dimensionality Reduction

TriMAP and PaCMAP

Common Pitfalls

Key Takeaways

Related reads

Stay ahead of the AI curve