msievers/how-to-quantize-and-fix-google-gemma-3-it-qat-gguf.md

## how-to-quantize-and-fix-google-gemma-3-it-qat-gguf.md

      
    Raw
  

              how-to-quantize-and-fix-google-gemma-3-it-qat-gguf.md
            
          
    The following instructions refer to gemma-3-1b-it-qat-q4_0 and Ubuntu Linux 24.04.3 LTS but can be applied to other
Gemma 3 IT QAT checkpoints as well.
0. Prerequisites

To quantize the original safetensors snapshot, we need llama.cpp and a few helper tools that can be obtained together
with llama.cpp. To make sure we are on the latest version, we first clone and build the newest version of llama.cpp.
sudo apt-get install -y --no-install-recommends \
        build-essential \
        cmake \
        git \
        python3 \
        python3-venv \
        python3-pip \
        ca-certificates \
        libxcb-cursor0

git clone https://github.com/ggml-org/llama.cpp.git

cmake -S ./llama.cpp -B ./llama.cpp/build
cmake --build ./llama.cpp/build --config Release
Next, we create a Python venv to install the required dependencies for the Python scripts provided by llama.cpp,
which are needed for the initial conversion from safetensors to GGUF and the subsequent adjustment of the GGUF metadata.
python3 -m venv .gemma3-quantization-venv
source .gemma3-quantization-venv/bin/activate
python -m pip install -U pip wheel

python -m pip install -r ./llama.cpp/requirements.txt
python -m pip install PySide6 # necessary for graphical llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py 
1. Download google/gemma-3-1b-it-qat-q4_0-unquantized

The following assumes that access to Hugging Face models is possible using Git and an SSH key stored with Hugging Face.
Ultimately, it does not matter how the model data is downloaded—via the website, via Git, or using the Hugging Face CLI tools.
The only important part is that afterward all files from the specified Hugging Face repository are located locally in the
directory gemma-3-1b-it-qat-q4_0-unquantized.
git clone git@hf.co:google/gemma-3-1b-it-qat-q4_0-unquantized
2. Convert safetensors to GGUF

First, make sure you have activated the Python virtual environment that was created at the beginning and that contains the
necessary Python dependencies to run the Python scripts shipped with llama.cpp.
After that, we can perform the initial conversion of the original model, which is in safetensors format, to
GGUF. At this stage, no quantization happens yet, we only convert safetensors to GGUF while keeping the original weights.
source .gemma3-quantization-venv/bin/activate

python ./llama.cpp/convert_hf_to_gguf.py ./gemma-3-1b-it-qat-q4_0-unquantized \
  --outfile gemma-3-1b-it-qat-q4_0-unquantized.gguf
3. Quantize GGUF to Q4_0 (without imatrix)

Now the GGUF created from the original weights can be quantized to Q4_0. We explicitly choose Q4_0 here because it is the
target quantization of the QAT weights. We deliberately avoid using a so-called imatrix (importance matrix), because it cannot
be ruled out that such an imatrix, together with the QAT weights, could potentially have negative effects.
./llama.cpp/build/bin/llama-quantize gemma-3-1b-it-qat-q4_0-unquantized.gguf \
  gemma-3-1b-it-qat-q4_0.gguf \
  Q4_0
4. Update model metadata

Right after converting the original model, some metadata is not yet optimal, is missing, or is incorrectly set in the resulting
GGUF. To ensure optimal model performance, the model’s metadata should therefore be adjusted accordingly.
Use <end_of_turn> token

Immediately after conversion, the model metadata contains the id of the eos (end of stream) token (<eos>):
tokenizer.ggml.eos_token_id = 1

However, for the instruction model we are using here, there is an additional token <end_of_turn> (106) that is
currently missing from the metadata and that is especially important for the  instruction model to prevent loops or
similar issues during a conversation.
There are now two ways to fix this problem:

Overwrite tokenizer.ggml.eos_token_id with the index of <end_of_turn> (106)
Add an additional property tokenizer.ggml.eot_token_id with the index of <end_of_turn> (106)

Variant 1) is used, for example, when maximum compatibility with older llama.cpp versions is required. The technically clean
solution, however, is variant 2), because both EOS and EOT (end-of-turn) are correctly defined.
We can use the gguf_editor_gui.py shipped with the llama.cpp repository to change the model metadata.
source .gemma3-quantization-venv/bin/activate
python ./llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py gemma-3-1b-it-qat-q4_0.gguf
The editor is a simple Qt GUI. After the model is opened, there is a button labeled Add Metadata in the lower part. After
clicking this button, you can insert a metadata field with the following values:

Key: tokenizer.ggml.eot_token_id
Type: UINT32
Value: 106

Afterward, the model must be saved using the Save as button. At this point, you should choose a new filename, because trying
to overwrite the existing model leads to a crash and to corruption of the edited model.
Set token type for <start_of_image> and <end_of_image> to CONTROL

Directly after conversion, the tokens <start_of_image> (255999) and <end_of_image> (256000) each have the type
USER_DEFINED in the tokenizer, which is counterintuitive. If you compare this, for example, with the Unsloth
quantizations of the non-QAT Gemma3 versions, those tokens are defined as CONTROL there, which actually makes much more
sense.
To change the token type, you can again use the gguf_editor_gui.py shipped with the llama.cpp repository.
source .gemma3-quantization-venv/bin/activate
python ./llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py gemma-3-1b-it-qat-q4_0.gguf
In gguf_editor_gui, search for the entry tokenizer.ggml.tokens and click the button on the right labeled Edit tokenizer.
In the window that opens, filter for <start_of_image> and click the value in the Type column. In the pop-up that then
opens, change the value to CONTROL (3).
Then do the same for the token <end_of_image>.
No results found