- /
-
- Blog /
- Switching from Gemma 4 to Gemma 4 QAT
Switching from Gemma 4 to Gemma 4 QAT
While improving perplexity
2026-06-06
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 8192,32768,65536,131072
ai@ai:~$ llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -n 512 -d 8192,32768,65536,131072 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31699 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 827.57 ± 4.16 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 20.89 ± 0.00 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 552.09 ± 1.33 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 18.94 ± 0.00 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 382.28 ± 0.64 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 17.25 ± 0.00 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 236.46 ± 0.21 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 13.95 ± 0.00 |
llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 8192,32768,65536,131072
| model | size | params | backend | ngl | sm | test | t/s |
|---|---|---|---|---|---|---|---|
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | pp512 @ d8192 | 973.03 ± 18.59 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | tg512 @ d8192 | 36.12 ± 0.06 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | pp512 @ d32768 | 769.23 ± 12.14 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | tg512 @ d32768 | 32.98 ± 0.04 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | pp512 @ d65536 | 603.22 ± 7.29 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | tg512 @ d65536 | 30.44 ± 0.04 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | pp512 @ d131072 | 420.75 ± 3.54 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | tg512 @ d131072 | 25.11 ± 0.03 |
ai@ai:~/llama.cpp$ ./build/bin/llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor -n 512 -d 32768 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 31699 MiB): Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15849 MiB
| model | size | params | backend | ngl | sm | test | t/s |
|---|---|---|---|---|---|---|---|
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | pp512 @ d32768 | 765.46 ± 12.06 |
| gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | tensor | tg512 @ d32768 | 33.18 ± 0.00 |
build: d73cd0767 (9585)
Perplexity test
llama-perplexity -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -f _tests/wiki.test.raw -s 1024 -c 8192 -b 1024 -ub 1024 -ngl all
1 0.00.931.928 I common_init_result: fitting params to device memory ...
2 0.00.931.937 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
3 0.03.880.221 W load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
4 0.03.944.207 W load: control-looking token: 212 '' was not control-type; this is probably a bug in the model. its type will be overridden
5 0.03.944.760 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
6 0.03.973.296 W load: special_eog_ids contains '<|tool_response>', removing '' token from EOG list
7 0.08.242.752 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
8 0.08.296.367 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
9 0.08.378.919 I
10 0.08.379.010 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 1200 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | REPACK = 1 |
11 0.08.379.021 I perplexity: tokenizing the input ..
12 0.09.259.797 I perplexity: tokenization took 880.771 ms
13 0.09.259.943 I perplexity: calculating perplexity over 36 chunks, n_ctx=8192, batch_size=1024, n_seq=1
14 0.18.500.650 I perplexity: 9.24 seconds per pass - ETA 5.53 minutes
15 [1]2414.0873,[2]744.1431,[3]1208.4171,[4]1098.5517,[5]1136.4832,[6]1087.0484,[7]1362.3209,[8]1247.2345,[9]1245.1584,[10]1079.0721,[11]1232.4814,[12]1300.3726,[13]1304.6480,[14]1237.1090,[15]1219.7609,[16]1274.1374,[17]1271.9768,[18]1297.0858,[19]1304.4827,[20]1367.1843,[21]1315.0807,[22]1312.6935,[23]1405.5037,[24]1460.9886,[25]1499.5799,[26]1551.1594,[27]1583.2841,[28]1590.3732,[29]1624.3423,[30]1636.1713,[31]1663.1279,[32]1652.8445,[33]1644.3609,[34]1612.1382,[35]1664.2287,[36]1716.7822,
16 5.26.687.123 I Final estimate: PPL = 1716.7822 +/- 26.47241
llama-perplexity -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -f _tests/wiki.test.raw -s 1024 -c 8192 -b 1024 -ub 1024 -ngl all
| model | size MiB | wikipedia | frontend (js/css/html) | golang | c# |
|---|---|---|---|---|---|
| gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL | 16486 | 1716.78 | 1.98 | 639.18 | 11.78 |
| gemma-4-31B-it-GGUF:Q4_0 | 16535 | 11981.15 | 3.29 | 170.12 | 32.51 |
It's slightly smaller than the previous Q4 variant, yet across many topics (all I tried) it yields much lower perplexity.
