Quantization Reference Guide (GGUF / llama.cpp)
Note
Comprehensive overview of quantization schemes for llama.cpp / GGUF
models, used in tools such as Draw Things , Ollama , LM Studio , and
KoboldCpp .
Quantization compresses model weights to reduce memory and disk footprint,
while attempting to preserve as much model quality as possible.
Full-Precision / Minimal Quantization
Quantization
Bits
Size
Speed
Quality
Description
BF16
16
Very large
Slow
π Perfect
Practically lossless. Used for training and GPU inference.
Requires large VRAM (e.g. 30+ GB for 13B models).
2-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q2_K
2
Tiny
Fast
π¬ Low
Very compact, but reasoning and coherence degrade.
Q2_K_L
2
Tiny+
Faster
π¬ Low
Slight variant with alternate grouping/scaling.
Q2_K_XL
2
Tiny+
Medium
π Fair
Improved scaling; best among 2-bit quantizers.
IQ2_M
2
Tiny
Fast
π Fair
βImproved Quantizationβ variant; better outlier accuracy.
IQ2_XXS
2
Smallest
Fastest
π
Poor
Ultra-compact, experimental; quality highly reduced.
3-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q3_K_S
3
Small
Fast
π Decent
Compact; moderate reasoning degradation.
Q3_K_M
3
Medium
Medium
π Good
Balanced between accuracy and efficiency.
Q3_K_XL
3
Medium+
Slower
π€© High
Near 4-bit quality; strong compression/accuracy ratio.
IQ3_XXS
3
Tiny
Very fast
π Fair
Improved scaling; better consistency for low-RAM systems.
4-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q4_0
4
Small
Fast
π Basic
First-gen 4-bit, uniform quantization; low precision.
Q4_1
4
Small
Fast
π Better
Adds per-group scaling; higher quality than Q4_0.
Q4_K_S
4
Compact
Fast
π Good
Modern βKβ type, optimized for small memory.
Q4_K_M
4
Medium
Medium
π Very good
Best overall 4-bit choice ; strong balance.
Q4_K_XL
4
Medium+
Slower
π€© Excellent
Almost 5-bit fidelity; highly coherent and expressive.
5-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q5_K_S
5
Medium
Fast
π High
Excellent balance; near-lossless for many models.
Q5_K_M
5
Medium+
Medium
π€© Excellent
High-quality; hard to distinguish from 8-bit.
Q5_K_XL
5
Larger
Slower
π Superb
Top-tier sub-8-bit fidelity; great for reasoning and text.
6-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q6_K
6
Medium-Large
Slower
π Excellent
High accuracy; near full precision; great for production.
Q6_K_XL
6
Large
Slower
ππ Superb
Virtually indistinguishable from BF16 output.
8-bit Quantization Family
Quantization
Bits
Size
Speed
Quality
Description
Q8_0
8
Large
Moderate
π High
Classic 8-bit; nearly full-precision; stable.
Q8_K_XL
8
Large+
Slower
ππ Best
Enhanced dynamic range; almost identical to FP16.
Special Improved Quantization (IQ) Variants
Quantization
Bits
Size
Speed
Quality
Description
IQ4_NL
4
Compact
Fast
π High
Non-linear scaling; preserves semantics better.
IQ4_XS
4
Smaller
Faster
π Fair
Extra-small, fast variant; modest quality tradeoff.
Family
Typical Use Case
Comment
IQ2 / IQ3 / IQ4
Mobile / ultra-low memory devices
Extreme compression; use for previews only.
Q2_K / Q3_K
Entry-level CPUs / small GPUs
Compact but with noticeable logic loss.
Q4_K_M
Balanced default
Best all-around quantization; strong quality.
Q5_K_M / Q5_K_XL
High-quality inference
Excellent reasoning and coherence.
Q6_K / Q6_K_XL
Production-grade quality
Near-lossless results, stable performance.
Q8_0 / Q8_K_XL
Full-quality inference
Practically lossless; large memory use.
BF16
Training / unquantized inference
Maximum precision; very large footprint.
Practical Recommendations
RAM / VRAM
Recommended Quantization
Example Use
β€ 8 GB
Q3_K_S / IQ4_XS
Mobile or minimal laptop inference.
8β12 GB
Q4_K_M
Ideal for 7Bβ13B models; strong balance.
12β16 GB
Q5_K_M / Q4_K_XL
Balanced for reasoning, creativity, code.
16β24 GB
Q5_K_XL / Q6_K
High-end local inference and complex models.
β₯ 24 GB
Q6_K_XL / Q8_0 / BF16
Benchmarking and near-lossless local runs.