Skip to content

Instantly share code, notes, and snippets.

@barseghyanartur
Last active October 23, 2025 22:06
Show Gist options
  • Select an option

  • Save barseghyanartur/722ae7c09554efa4acc2111952720215 to your computer and use it in GitHub Desktop.

Select an option

Save barseghyanartur/722ae7c09554efa4acc2111952720215 to your computer and use it in GitHub Desktop.
Quantisation Reference Guide

Quantization Reference Guide (GGUF / llama.cpp)

Note

Comprehensive overview of quantization schemes for llama.cpp / GGUF models, used in tools such as Draw Things, Ollama, LM Studio, and KoboldCpp.

Quantization compresses model weights to reduce memory and disk footprint, while attempting to preserve as much model quality as possible.

Full-Precision / Minimal Quantization

Quantization Bits Size Speed Quality Description
BF16 16 Very large Slow πŸŒ• Perfect Practically lossless. Used for training and GPU inference. Requires large VRAM (e.g. 30+ GB for 13B models).

2-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q2_K 2 Tiny Fast 😬 Low Very compact, but reasoning and coherence degrade.
Q2_K_L 2 Tiny+ Faster 😬 Low Slight variant with alternate grouping/scaling.
Q2_K_XL 2 Tiny+ Medium πŸ™‚ Fair Improved scaling; best among 2-bit quantizers.
IQ2_M 2 Tiny Fast πŸ™‚ Fair β€œImproved Quantization” variant; better outlier accuracy.
IQ2_XXS 2 Smallest Fastest πŸ˜… Poor Ultra-compact, experimental; quality highly reduced.

3-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q3_K_S 3 Small Fast πŸ™‚ Decent Compact; moderate reasoning degradation.
Q3_K_M 3 Medium Medium πŸ˜ƒ Good Balanced between accuracy and efficiency.
Q3_K_XL 3 Medium+ Slower 🀩 High Near 4-bit quality; strong compression/accuracy ratio.
IQ3_XXS 3 Tiny Very fast πŸ™‚ Fair Improved scaling; better consistency for low-RAM systems.

4-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q4_0 4 Small Fast 😐 Basic First-gen 4-bit, uniform quantization; low precision.
Q4_1 4 Small Fast πŸ™‚ Better Adds per-group scaling; higher quality than Q4_0.
Q4_K_S 4 Compact Fast πŸ™‚ Good Modern β€œK” type, optimized for small memory.
Q4_K_M 4 Medium Medium πŸ˜ƒ Very good Best overall 4-bit choice; strong balance.
Q4_K_XL 4 Medium+ Slower 🀩 Excellent Almost 5-bit fidelity; highly coherent and expressive.

5-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q5_K_S 5 Medium Fast πŸ˜ƒ High Excellent balance; near-lossless for many models.
Q5_K_M 5 Medium+ Medium 🀩 Excellent High-quality; hard to distinguish from 8-bit.
Q5_K_XL 5 Larger Slower πŸŒ• Superb Top-tier sub-8-bit fidelity; great for reasoning and text.

6-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q6_K 6 Medium-Large Slower πŸŒ• Excellent High accuracy; near full precision; great for production.
Q6_K_XL 6 Large Slower πŸŒ•πŸŒ• Superb Virtually indistinguishable from BF16 output.

8-bit Quantization Family

Quantization Bits Size Speed Quality Description
Q8_0 8 Large Moderate πŸŒ• High Classic 8-bit; nearly full-precision; stable.
Q8_K_XL 8 Large+ Slower πŸŒ•πŸŒ• Best Enhanced dynamic range; almost identical to FP16.

Special Improved Quantization (IQ) Variants

Quantization Bits Size Speed Quality Description
IQ4_NL 4 Compact Fast πŸ˜ƒ High Non-linear scaling; preserves semantics better.
IQ4_XS 4 Smaller Faster πŸ™‚ Fair Extra-small, fast variant; modest quality tradeoff.

Quick Reference Summary

Family Typical Use Case Comment
IQ2 / IQ3 / IQ4 Mobile / ultra-low memory devices Extreme compression; use for previews only.
Q2_K / Q3_K Entry-level CPUs / small GPUs Compact but with noticeable logic loss.
Q4_K_M Balanced default Best all-around quantization; strong quality.
Q5_K_M / Q5_K_XL High-quality inference Excellent reasoning and coherence.
Q6_K / Q6_K_XL Production-grade quality Near-lossless results, stable performance.
Q8_0 / Q8_K_XL Full-quality inference Practically lossless; large memory use.
BF16 Training / unquantized inference Maximum precision; very large footprint.

Practical Recommendations

RAM / VRAM Recommended Quantization Example Use
≀ 8 GB Q3_K_S / IQ4_XS Mobile or minimal laptop inference.
8–12 GB Q4_K_M Ideal for 7B–13B models; strong balance.
12–16 GB Q5_K_M / Q4_K_XL Balanced for reasoning, creativity, code.
16–24 GB Q5_K_XL / Q6_K High-end local inference and complex models.
β‰₯ 24 GB Q6_K_XL / Q8_0 / BF16 Benchmarking and near-lossless local runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment