barseghyanartur/quantisation_reference_guide.rst

## quantisation_reference_guide.rst

      
    Raw
  

              quantisation_reference_guide.rst
            
          
    Quantization Reference Guide (GGUF / llama.cpp)


Note
Comprehensive overview of quantization schemes for llama.cpp / GGUF
models, used in tools such as Draw Things, Ollama, LM Studio, and
KoboldCpp.
Quantization compresses model weights to reduce memory and disk footprint,
while attempting to preserve as much model quality as possible.


Full-Precision / Minimal Quantization


Quantization
Bits
Size
Speed
Quality
Description


BF16
16
Very large
Slow
🌕 Perfect
Practically lossless. Used for training and GPU inference.
Requires large VRAM (e.g. 30+ GB for 13B models).


2-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q2_K
2
Tiny
Fast
😬 Low
Very compact, but reasoning and coherence degrade.

Q2_K_L
2
Tiny+
Faster
😬 Low
Slight variant with alternate grouping/scaling.

Q2_K_XL
2
Tiny+
Medium
🙂 Fair
Improved scaling; best among 2-bit quantizers.

IQ2_M
2
Tiny
Fast
🙂 Fair
“Improved Quantization” variant; better outlier accuracy.

IQ2_XXS
2
Smallest
Fastest
😅 Poor
Ultra-compact, experimental; quality highly reduced.


3-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q3_K_S
3
Small
Fast
🙂 Decent
Compact; moderate reasoning degradation.

Q3_K_M
3
Medium
Medium
😃 Good
Balanced between accuracy and efficiency.

Q3_K_XL
3
Medium+
Slower
🤩 High
Near 4-bit quality; strong compression/accuracy ratio.

IQ3_XXS
3
Tiny
Very fast
🙂 Fair
Improved scaling; better consistency for low-RAM systems.


4-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q4_0
4
Small
Fast
😐 Basic
First-gen 4-bit, uniform quantization; low precision.

Q4_1
4
Small
Fast
🙂 Better
Adds per-group scaling; higher quality than Q4_0.

Q4_K_S
4
Compact
Fast
🙂 Good
Modern “K” type, optimized for small memory.

Q4_K_M
4
Medium
Medium
😃 Very good
Best overall 4-bit choice; strong balance.

Q4_K_XL
4
Medium+
Slower
🤩 Excellent
Almost 5-bit fidelity; highly coherent and expressive.


5-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q5_K_S
5
Medium
Fast
😃 High
Excellent balance; near-lossless for many models.

Q5_K_M
5
Medium+
Medium
🤩 Excellent
High-quality; hard to distinguish from 8-bit.

Q5_K_XL
5
Larger
Slower
🌕 Superb
Top-tier sub-8-bit fidelity; great for reasoning and text.


6-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q6_K
6
Medium-Large
Slower
🌕 Excellent
High accuracy; near full precision; great for production.

Q6_K_XL
6
Large
Slower
🌕🌕 Superb
Virtually indistinguishable from BF16 output.


8-bit Quantization Family


Quantization
Bits
Size
Speed
Quality
Description


Q8_0
8
Large
Moderate
🌕 High
Classic 8-bit; nearly full-precision; stable.

Q8_K_XL
8
Large+
Slower
🌕🌕 Best
Enhanced dynamic range; almost identical to FP16.


Special Improved Quantization (IQ) Variants


Quantization
Bits
Size
Speed
Quality
Description


IQ4_NL
4
Compact
Fast
😃 High
Non-linear scaling; preserves semantics better.

IQ4_XS
4
Smaller
Faster
🙂 Fair
Extra-small, fast variant; modest quality tradeoff.


Quick Reference Summary


Family
Typical Use Case
Comment


IQ2 / IQ3 / IQ4
Mobile / ultra-low memory devices
Extreme compression; use for previews only.

Q2_K / Q3_K
Entry-level CPUs / small GPUs
Compact but with noticeable logic loss.

Q4_K_M
Balanced default
Best all-around quantization; strong quality.

Q5_K_M / Q5_K_XL
High-quality inference
Excellent reasoning and coherence.

Q6_K / Q6_K_XL
Production-grade quality
Near-lossless results, stable performance.

Q8_0 / Q8_K_XL
Full-quality inference
Practically lossless; large memory use.

BF16
Training / unquantized inference
Maximum precision; very large footprint.


Practical Recommendations


RAM / VRAM
Recommended Quantization
Example Use


≤ 8 GB
Q3_K_S / IQ4_XS
Mobile or minimal laptop inference.

8–12 GB
Q4_K_M
Ideal for 7B–13B models; strong balance.

12–16 GB
Q5_K_M / Q4_K_XL
Balanced for reasoning, creativity, code.

16–24 GB
Q5_K_XL / Q6_K
High-end local inference and complex models.

≥ 24 GB
Q6_K_XL / Q8_0 / BF16
Benchmarking and near-lossless local runs.
Quantization	Bits	Size	Speed	Quality	Description
Q2_K	2	Tiny	Fast	😬 Low	Very compact, but reasoning and coherence degrade.
Q2_K_L	2	Tiny+	Faster	😬 Low	Slight variant with alternate grouping/scaling.
Q2_K_XL	2	Tiny+	Medium	🙂 Fair	Improved scaling; best among 2-bit quantizers.
IQ2_M	2	Tiny	Fast	🙂 Fair	“Improved Quantization” variant; better outlier accuracy.
IQ2_XXS	2	Smallest	Fastest	😅 Poor	Ultra-compact, experimental; quality highly reduced.
Quantization	Bits	Size	Speed	Quality	Description
Q3_K_S	3	Small	Fast	🙂 Decent	Compact; moderate reasoning degradation.
Q3_K_M	3	Medium	Medium	😃 Good	Balanced between accuracy and efficiency.
Q3_K_XL	3	Medium+	Slower	🤩 High	Near 4-bit quality; strong compression/accuracy ratio.
IQ3_XXS	3	Tiny	Very fast	🙂 Fair	Improved scaling; better consistency for low-RAM systems.
Quantization	Bits	Size	Speed	Quality	Description
Q4_0	4	Small	Fast	😐 Basic	First-gen 4-bit, uniform quantization; low precision.
Q4_1	4	Small	Fast	🙂 Better	Adds per-group scaling; higher quality than Q4_0.
Q4_K_S	4	Compact	Fast	🙂 Good	Modern “K” type, optimized for small memory.
Q4_K_M	4	Medium	Medium	😃 Very good	Best overall 4-bit choice; strong balance.
Q4_K_XL	4	Medium+	Slower	🤩 Excellent	Almost 5-bit fidelity; highly coherent and expressive.
Quantization	Bits	Size	Speed	Quality	Description
Q5_K_S	5	Medium	Fast	😃 High	Excellent balance; near-lossless for many models.
Q5_K_M	5	Medium+	Medium	🤩 Excellent	High-quality; hard to distinguish from 8-bit.
Q5_K_XL	5	Larger	Slower	🌕 Superb	Top-tier sub-8-bit fidelity; great for reasoning and text.
Quantization	Bits	Size	Speed	Quality	Description
Q6_K	6	Medium-Large	Slower	🌕 Excellent	High accuracy; near full precision; great for production.
Q6_K_XL	6	Large	Slower	🌕🌕 Superb	Virtually indistinguishable from BF16 output.
Quantization	Bits	Size	Speed	Quality	Description
Q8_0	8	Large	Moderate	🌕 High	Classic 8-bit; nearly full-precision; stable.
Q8_K_XL	8	Large+	Slower	🌕🌕 Best	Enhanced dynamic range; almost identical to FP16.
Quantization	Bits	Size	Speed	Quality	Description
IQ4_NL	4	Compact	Fast	😃 High	Non-linear scaling; preserves semantics better.
IQ4_XS	4	Smaller	Faster	🙂 Fair	Extra-small, fast variant; modest quality tradeoff.
Family	Typical Use Case	Comment
IQ2 / IQ3 / IQ4	Mobile / ultra-low memory devices	Extreme compression; use for previews only.
Q2_K / Q3_K	Entry-level CPUs / small GPUs	Compact but with noticeable logic loss.
Q4_K_M	Balanced default	Best all-around quantization; strong quality.
Q5_K_M / Q5_K_XL	High-quality inference	Excellent reasoning and coherence.
Q6_K / Q6_K_XL	Production-grade quality	Near-lossless results, stable performance.
Q8_0 / Q8_K_XL	Full-quality inference	Practically lossless; large memory use.
BF16	Training / unquantized inference	Maximum precision; very large footprint.
RAM / VRAM	Recommended Quantization	Example Use
≤ 8 GB	Q3_K_S / IQ4_XS	Mobile or minimal laptop inference.
8–12 GB	Q4_K_M	Ideal for 7B–13B models; strong balance.
12–16 GB	Q5_K_M / Q4_K_XL	Balanced for reasoning, creativity, code.
16–24 GB	Q5_K_XL / Q6_K	High-end local inference and complex models.
≥ 24 GB	Q6_K_XL / Q8_0 / BF16	Benchmarking and near-lossless local runs.