Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save msievers/86264a988ab3d5c826e0e02394a9a629 to your computer and use it in GitHub Desktop.

Select an option

Save msievers/86264a988ab3d5c826e0e02394a9a629 to your computer and use it in GitHub Desktop.
How to quantize and fix Google Gemma 3 IT QAT GGUF

The following instructions refer to gemma-3-1b-it-qat-q4_0 and Ubuntu Linux 24.04.3 LTS but can be applied to other Gemma 3 IT QAT checkpoints as well.

0. Prerequisites

To quantize the original safetensors snapshot, we need llama.cpp and a few helper tools that can be obtained together with llama.cpp. To make sure we are on the latest version, we first clone and build the newest version of llama.cpp.

sudo apt-get install -y --no-install-recommends \
        build-essential \
        cmake \
        git \
        python3 \
        python3-venv \
        python3-pip \
        ca-certificates \
        libxcb-cursor0

git clone https://github.com/ggml-org/llama.cpp.git

cmake -S ./llama.cpp -B ./llama.cpp/build
cmake --build ./llama.cpp/build --config Release

Next, we create a Python venv to install the required dependencies for the Python scripts provided by llama.cpp, which are needed for the initial conversion from safetensors to GGUF and the subsequent adjustment of the GGUF metadata.

python3 -m venv .gemma3-quantization-venv
source .gemma3-quantization-venv/bin/activate
python -m pip install -U pip wheel

python -m pip install -r ./llama.cpp/requirements.txt
python -m pip install PySide6 # necessary for graphical llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py 

1. Download google/gemma-3-1b-it-qat-q4_0-unquantized

The following assumes that access to Hugging Face models is possible using Git and an SSH key stored with Hugging Face. Ultimately, it does not matter how the model data is downloaded—via the website, via Git, or using the Hugging Face CLI tools. The only important part is that afterward all files from the specified Hugging Face repository are located locally in the directory gemma-3-1b-it-qat-q4_0-unquantized.

git clone git@hf.co:google/gemma-3-1b-it-qat-q4_0-unquantized

2. Convert safetensors to GGUF

First, make sure you have activated the Python virtual environment that was created at the beginning and that contains the necessary Python dependencies to run the Python scripts shipped with llama.cpp.

After that, we can perform the initial conversion of the original model, which is in safetensors format, to GGUF. At this stage, no quantization happens yet, we only convert safetensors to GGUF while keeping the original weights.

source .gemma3-quantization-venv/bin/activate

python ./llama.cpp/convert_hf_to_gguf.py ./gemma-3-1b-it-qat-q4_0-unquantized \
  --outfile gemma-3-1b-it-qat-q4_0-unquantized.gguf

3. Quantize GGUF to Q4_0 (without imatrix)

Now the GGUF created from the original weights can be quantized to Q4_0. We explicitly choose Q4_0 here because it is the target quantization of the QAT weights. We deliberately avoid using a so-called imatrix (importance matrix), because it cannot be ruled out that such an imatrix, together with the QAT weights, could potentially have negative effects.

./llama.cpp/build/bin/llama-quantize gemma-3-1b-it-qat-q4_0-unquantized.gguf \
  gemma-3-1b-it-qat-q4_0.gguf \
  Q4_0

4. Update model metadata

Right after converting the original model, some metadata is not yet optimal, is missing, or is incorrectly set in the resulting GGUF. To ensure optimal model performance, the model’s metadata should therefore be adjusted accordingly.

Use <end_of_turn> token

Immediately after conversion, the model metadata contains the id of the eos (end of stream) token (<eos>):

tokenizer.ggml.eos_token_id = 1

However, for the instruction model we are using here, there is an additional token <end_of_turn> (106) that is currently missing from the metadata and that is especially important for the instruction model to prevent loops or similar issues during a conversation.

There are now two ways to fix this problem:

  1. Overwrite tokenizer.ggml.eos_token_id with the index of <end_of_turn> (106)
  2. Add an additional property tokenizer.ggml.eot_token_id with the index of <end_of_turn> (106)

Variant 1) is used, for example, when maximum compatibility with older llama.cpp versions is required. The technically clean solution, however, is variant 2), because both EOS and EOT (end-of-turn) are correctly defined.

We can use the gguf_editor_gui.py shipped with the llama.cpp repository to change the model metadata.

source .gemma3-quantization-venv/bin/activate
python ./llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py gemma-3-1b-it-qat-q4_0.gguf

The editor is a simple Qt GUI. After the model is opened, there is a button labeled Add Metadata in the lower part. After clicking this button, you can insert a metadata field with the following values:

  • Key: tokenizer.ggml.eot_token_id
  • Type: UINT32
  • Value: 106

Afterward, the model must be saved using the Save as button. At this point, you should choose a new filename, because trying to overwrite the existing model leads to a crash and to corruption of the edited model.

Set token type for <start_of_image> and <end_of_image> to CONTROL

Directly after conversion, the tokens <start_of_image> (255999) and <end_of_image> (256000) each have the type USER_DEFINED in the tokenizer, which is counterintuitive. If you compare this, for example, with the Unsloth quantizations of the non-QAT Gemma3 versions, those tokens are defined as CONTROL there, which actually makes much more sense.

To change the token type, you can again use the gguf_editor_gui.py shipped with the llama.cpp repository.

source .gemma3-quantization-venv/bin/activate
python ./llama.cpp/gguf-py/gguf/scripts/gguf_editor_gui.py gemma-3-1b-it-qat-q4_0.gguf

In gguf_editor_gui, search for the entry tokenizer.ggml.tokens and click the button on the right labeled Edit tokenizer. In the window that opens, filter for <start_of_image> and click the value in the Type column. In the pop-up that then opens, change the value to CONTROL (3).

Then do the same for the token <end_of_image>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment