LLM Quantization Methods Explained

Running large language models at full precision is expensive. A 70B parameter model in FP16 needs roughly 140GB of GPU memory — that’s multiple high-end GPUs just for inference. Quantization compresses model weights to lower precision formats, dramatically reducing memory and compute requirements while preserving most of the model’s capability.

This guide covers the major quantization approaches, when to use each, and the real-world tradeoffs you’ll encounter.

What Quantization Actually Does

Neural network weights are typically stored as 16-bit or 32-bit floating point numbers. Quantization maps these values to lower-precision representations — usually 8-bit, 4-bit, or even 2-bit integers.

The core idea is straightforward: most weight values cluster in a narrow range. You can represent them with fewer bits if you choose the right mapping. The challenge is doing this without destroying the model’s ability to generate coherent, accurate outputs.

Precision Formats at a Glance

FP32 (32-bit float): Full precision. Baseline for training. ~4 bytes per parameter.
FP16 / BF16 (16-bit): Standard for inference. ~2 bytes per parameter. Minimal quality loss.
INT8 (8-bit integer): First major compression step. ~1 byte per parameter.
INT4 (4-bit integer): Aggressive compression. ~0.5 bytes per parameter.
INT2-3 (2-3 bit): Experimental. Significant quality tradeoffs.

Post-Training Quantization (PTQ)

PTQ applies quantization after the model has been trained. No additional training is needed — you take an existing model and convert it. This is the most common approach for deploying open-weight models.

Round-to-Nearest (RTN)

The simplest method. Map each weight to the nearest quantized value. Works reasonably well for 8-bit but degrades significantly at 4-bit for larger models. Not recommended as a primary approach for production use.

GPTQ

GPTQ uses a small calibration dataset to minimize the quantization error layer by layer. It solves an optimization problem: given the quantization constraint, find the rounding decisions that minimize output error.

Strengths:

Well-established with broad tooling support
Good quality at 4-bit for most models
One-time cost: quantize once, use forever

Weaknesses:

Requires a calibration dataset (usually 128-256 samples)
Quantization process can take hours for large models
Quality depends on calibration data choice

AWQ (Activation-Aware Weight Quantization)

AWQ observes that not all weights are equally important. It identifies “salient” weight channels — those that produce large activations — and protects them during quantization by scaling them up before quantizing, then scaling activations down to compensate.

# Conceptual AWQ flow
# 1. Run calibration data through the model
# 2. Identify channels with large activations
# 3. Apply per-channel scaling to protect salient weights
# 4. Quantize the scaled weights
# 5. Adjust activation scaling to compensate

AWQ typically outperforms GPTQ at 4-bit, especially on reasoning-heavy benchmarks.

GGUF / llama.cpp Quantization

The GGUF format (used by llama.cpp and its ecosystem) offers a range of quantization types optimized for CPU inference. These use mixed-precision block quantization — different blocks of weights can use different bit widths.

Common GGUF quantization types:

Q8_0: 8-bit, near-lossless. Good baseline.
Q5_K_M: 5-bit with k-quant optimization. Excellent quality-to-size ratio.
Q4_K_M: 4-bit k-quant. The sweet spot for most users.
Q3_K_M: 3-bit. Noticeable quality drop but usable.
Q2_K: 2-bit. Research territory. Significant degradation.

The “K” variants use importance-based mixed precision within the quantization scheme — more important layers get more bits.

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the model to adapt its weights to the lower precision. This generally produces better results than PTQ at the same bit width, but requires access to training infrastructure and data.

When QAT Makes Sense

You’re the model provider or have significant compute budget
You need the absolute best quality at a given bit width
The model will be deployed at massive scale (amortizing training cost)
You’re targeting very low precision (2-3 bit)

Google’s approach with Gemma and Meta’s with Llama have both included QAT variants alongside their base releases, recognizing that many users will run quantized versions.

Choosing the Right Method

For Local / Consumer Hardware

If you’re running models on a desktop or laptop:

Model Size	RAM Available	Recommended
7-8B	8GB	Q4_K_M GGUF
7-8B	16GB	Q5_K_M or Q8_0 GGUF
13-14B	16GB	Q4_K_M GGUF
70B	64GB	Q4_K_M GGUF

For Server / GPU Deployment

If you’re serving models on GPU infrastructure:

Single GPU (24GB): AWQ 4-bit for models up to ~30B parameters
Single GPU (48-80GB): FP16 for smaller models, AWQ/GPTQ 4-bit for 70B+
Multi-GPU: Consider whether the cost savings from quantization justify the quality tradeoff

Quality vs. Compression Tradeoffs

As a general rule:

8-bit: <1% quality loss on most benchmarks. Safe default.
4-bit (good method): 1-3% quality loss. Acceptable for most applications.
4-bit (naive method): 3-8% quality loss. Noticeable in complex reasoning.
3-bit: 5-15% quality loss. Use only when memory-constrained.
2-bit: 15-30%+ quality loss. Not recommended for production.

These numbers vary significantly by model architecture, quantization method, and evaluation task.

Practical Tips

Measure Quality for Your Use Case

Benchmark numbers don’t tell the whole story. A model that scores well on MMLU might perform poorly on your specific task after quantization. Always evaluate on representative examples from your actual workload.

Perplexity Is Your Friend

Perplexity on a held-out text corpus is a quick, reliable proxy for quantization quality. Compute it before and after quantization. If perplexity increases by more than 0.5-1.0 points, you might be losing meaningful capability.

# Example with llama.cpp
./llama-perplexity -m model-q4_k_m.gguf -f wiki.test.raw

Watch for Outlier Sensitivity

Some models have weight outliers that cause disproportionate quantization error. If you see unexpected quality drops, try methods that handle outliers explicitly (like AWQ or SmoothQuant).

Consider the Full Stack

Quantization doesn’t exist in isolation. Your inference engine matters:

llama.cpp: Excellent GGUF support, CPU and GPU
vLLM: Good AWQ/GPTQ support, optimized GPU serving
TensorRT-LLM: NVIDIA-optimized, supports multiple formats
ExLlamaV2: Specialized in GPTQ/EXL2, very fast

The State of Quantization in 2026

The field has matured significantly. Key trends:

Model providers ship quantized variants. You rarely need to quantize yourself for popular models.
4-bit is the practical floor for general-purpose use. Below 4-bit, quality losses become application-specific.
Speculative decoding + quantization is an increasingly powerful combination — use a small quantized model for draft tokens and a larger model for verification.
Hardware is catching up. New GPU architectures include native support for low-precision integer operations, making quantized inference even faster.

Bottom Line

Quantization is no longer optional knowledge for anyone deploying LLMs. The difference between running a model in FP16 and 4-bit AWQ can mean the difference between needing 4 GPUs and needing 1 — a 4x cost reduction with minimal quality impact.

Start with 4-bit AWQ or Q4_K_M GGUF depending on your deployment target. Measure quality on your specific use case. Only go lower if you must, and only go higher if quality demands it.

LLM Quantization Methods Explained

LLM Quantization Methods Explained

What Quantization Actually Does

Precision Formats at a Glance

Post-Training Quantization (PTQ)

Round-to-Nearest (RTN)

GPTQ

AWQ (Activation-Aware Weight Quantization)

GGUF / llama.cpp Quantization

Quantization-Aware Training (QAT)

When QAT Makes Sense

Choosing the Right Method

For Local / Consumer Hardware

For Server / GPU Deployment

Quality vs. Compression Tradeoffs

Practical Tips

Measure Quality for Your Use Case

Perplexity Is Your Friend

Watch for Outlier Sensitivity

Consider the Full Stack

The State of Quantization in 2026

Bottom Line

Related reads

Stay ahead of the AI curve