- /
-
- Blog /
- Ollama vs Llama.cpp
Ollama vs Llama.cpp
How I improved token/s by 10%
2026-06-05
Why
Previously I ran llamacpp just to prove things were working, then later switched to ollama for simplicity, but now I want to return to llamacpp. It should be more performant, but I want to check for myself. I want to run the Gemma4 31B model, but with llamacpp there are so many more options to choose from with various quantizations that I want to benchmark which one will be the go-to one for my setups. One for dual GPU setup and dedicated to AI, but powered only on demand when I need it. And one with a single GPU but running 24/7. So having something for random ad hoc tasks and one for more targeted and intentional API calls.
Benchmark
Running similarly sized variants of gemma4:31b. On these two base platforms, my model decision is here if you want to skip it.
Rig AI
Dedicated for AI but not running constantly:
- Operating System: Debian 13
- Hardware: Gigabyte Z390i + i5-9400 + 64GB RAM
- GPUs: 2x GeForce RTX 5060Ti 16GB (together 29.8GiB)
- connected with PCIe bifurcation to one 16x slot.
- CUDA compute capability: 12.0
Will be running:
- Model
gemma-4-31B-it-GGUF:Q4_1 - Context 90000
- Listen on all interfaces
- Disabled multimodal, as I'm not interested in voice or video
- Jinja to support tools calling
- With the following command:
/opt/bin/llama-server -m /home/ai/.cache/huggingface/hub/models--unsloth--gemma-4-31B-it-qat-GGUF/snapshots/0ad8071a52116c0ec83e13c0232b58cd5ab0681f/gemma-4-31B-it-GGUF:Q4_1 -c 90000 --jinja --port 8080 -ngl all --temp 1.0 --top-k 64 --top-p 0.95 --host 0.0.0.0 --no-mmproj
Rig NAS
A virtual machine running inside the Proxmox NAS setup, with PCIe passthrough for the GPU:
- Operating System: Windows
- Hardware: Gigabyte B560i + Ryzen-9600X + 96GB RAM
- GPU: GeForce RTX 4060Ti 16GB (14.9GiB)
- CUDA compute capability: 8.9
Will be running with the following command:
llama-server -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M --host 0.0.0.0 --no-mmproj
The idea is to not load the model straight away but only on demand and offload it as soon as possible to not hog the shared machine and GPU with this task. Also, a much more quantized model had to be used to fit into the VRAM.
Why Gemma4
It runs well with Cline, thinking/reasoning, chain, tools invocation and editing files works well enough. Doesn't go spastic as much as others in my experience. While the 26b is faster and smaller, it is a mix of experts and not always as good in Cline (can end up failing on a file edit and looping with errors much more easily). And then the 12b could be used on the NAS for more casual requests when the dual GPU rig is not running.
Summary and decision
Both engines are optimized for environments with single user and maximum single request running at any time. And to make it fair, I matched llamacpp to ollama's parameters.
| Model | Engine | Quantization | Size (GB) | Context | t/s | Split mode |
|---|---|---|---|---|---|---|
| gemma4:31b | Ollama | Q4_K_M | 19 | 32k | 19.96 | default |
| gemma-4-31B-it-GGUF | llamacpp | Q4_K_M | 18.3 | 32k | 22.0 | default |
The size discrepancy is likely due to the fact that Ollama's model bundles in a single file the multimodal model, while with llamacpp it's a separate file (which we can later explicitly tell not to load).
Commands used:
OLLAMA_MAX_LOADED_MODELS=1 && OLLAMA_NUM_PARALLEL=1 && ollama run --verbose gemma4:31b "write simple C++ webserver"llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 --reasoning off --temp 1.0 --top-k 64 --top-p 0.9 -c 2048 -p "write simple c++ hello world"
Going to switch to llamacpp because:
- When using same quantization, same context llamacpp is ~10% faster.
- The new llamacpp webui is much nicer and responsive than the ollama's GUI.
- Ollama is unnecessary wrapper, I want direct access to fine tune and tweaking the llamacpp.
- Embracing me more to
- Ollama is trying to push more and more the cloud models and I do not like it. Even if it might not be as big a deal, as their Ollama models are just extra bonus, we can on both use the very same hugging face models anyway. But at the same time then the one appeal of Ollama is going away. Having a decent amount, but not an overwhelming amount of model choices compared to hugging face as the hugging face choices might look intimidating, just one model has many quantization selections:
So llamacpp won and I am going to switch but now I want to figure out which exact model quantization I should use. The section below shows what runs I did on ollama and llamacpp. And also want to experiment a tiny bit with single GPU model sizes and compare how much overhead there is running it on two GPUs.
You can skip the details below if not interested.
Runs
Ollama - splitting the load on two GPUs
Going to figure out the details of the ollama model: ollama show gemma4:31b
1 Model
2 architecture gemma4
3 parameters 31.3B
4 context length 262144
5 embedding length 5376
6 quantization Q4_K_M
7 requires 0.20.0
8
9 Capabilities
10 completion
11 vision
12 tools
13 thinking
14
15 Parameters
16 temperature 1
17 top_k 64
18 top_p 0.95
19
20 License
21 Apache License
22 Version 2.0, January 2004
23 ...
Quantization is Q4_K_M, temperature 1, top_k 64 and top_p 0.95
I ran multiple runs, they fluctuate tiny bit, but average is under 20 tokens/s.
ollama run --verbose gemma4:31b "write simple C++ webserver"
1 total duration: 1m55.384730898s
2 load duration: 12.326818119s
3 prompt eval count: 22 token(s)
4 prompt eval duration: 69.191131ms
5 prompt eval rate: 317.96 tokens/s
6 eval count: 2032 token(s)
7 eval duration: 1m41.752473912s
8 eval rate: 19.97 tokens/s
1 total duration: 1m35.045527887s
2 load duration: 297.417321ms
3 prompt eval count: 22 token(s)
4 prompt eval duration: 63.148366ms
5 prompt eval rate: 348.39 tokens/s
6 eval count: 1874 token(s)
7 eval duration: 1m33.667665968s
8 eval rate: 19.99 tokens/s
1 total duration: 1m55.730248723s
2 load duration: 294.014649ms
3 prompt eval count: 22 token(s)
4 prompt eval duration: 63.19743ms
5 prompt eval rate: 348.12 tokens/s
6 eval count: 2275 token(s)
7 eval duration: 1m54.110752464s
8 eval rate: 19.94 tokens/s
And the context-window is 32k:
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:31b 6316f0629137 28 GB 100% GPU 32768 4 hours from now
Llamacpp
The GeForce RTX 5060 Ti uses the Blackwell architecture and has a CUDA Compute Capability of 12.0 and built llamacpp with 120 on.
Checking GPU support:
1 ai@ai:~$ llama-server --list-devices
2 Available devices:
3 CUDA0: NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)
4 CUDA1: NVIDIA GeForce RTX 5060 Ti (15849 MiB, 15712 MiB free)
2-bit and 3-bit dual GPU
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M -hf unsloth/gemma-4-31B-it-GGUF:Q3_K_M -n 512 -d 8192,32768,65536,131072,190000
Bit higher context windows:
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_M -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -n 512 -d 230000
Maximum context window:
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS -n 512 -d 262144
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 798.88 ± 2.13 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 31.25 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 538.31 ± 2.36 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 27.21 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 376.02 ± 0.49 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 23.77 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 234.13 ± 0.19 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 17.90 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d190000 | 174.62 ± 0.11 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d190000 | 14.97 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d230000 | 149.02 ± 0.09 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d230000 | 13.38 ± 0.00 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | pp512 @ d262144 | 132.02 ± 0.10 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | tg512 @ d262144 | 12.44 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 660.15 ± 1.79 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 27.65 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 472.82 ± 0.81 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 24.43 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 342.64 ± 0.46 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 21.62 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 220.65 ± 0.13 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 16.65 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d190000 | 166.96 ± 0.11 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d190000 | 14.09 ± 0.00 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | pp512 @ d230000 | 141.98 ± 0.14 |
| gemma4 31B IQ2_M - 2.7 bpw | 10.00 GiB | 30.70 B | CUDA | -1 | tg512 @ d230000 | 12.69 ± 0.00 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 684.59 ± 2.77 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 22.12 ± 0.00 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 485.47 ± 0.95 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 19.93 ± 0.00 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 348.94 ± 0.52 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 18.07 ± 0.00 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 223.23 ± 0.20 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 14.48 ± 0.00 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | pp512 @ d190000 | 166.84 ± 0.10 |
| gemma4 31B Q3_K - Medium | 13.71 GiB | 30.70 B | CUDA | -1 | tg512 @ d190000 | 12.48 ± 0.00 |
2-bit on single GPU
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS --split-mode none --main-gpu 1 -n 512 -d 8192,32768,65536
| model | size | params | backend | ngl | main_gpu | sm | test | t/s |
|---|---|---|---|---|---|---|---|---|
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | pp512 @ d8192 | 791.74 ± 9.66 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | tg512 @ d8192 | 30.55 ± 0.09 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | pp512 @ d32768 | 537.26 ± 2.05 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | tg512 @ d32768 | 26.82 ± 0.03 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | pp512 @ d65536 | 375.05 ± 1.08 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | none | tg512 @ d65536 | 23.69 ± 0.01 |
Similar to default dual GPU setup, overhead from layer setup is minimal and running single GPU will let it trigger thermals sooner and not let it turbo as much as two GPUs with half load each.
2-bit tensor dual GPU
Instead of dividing the model sequentially (where GPU 1 sits around waiting for GPU 0), --split-mode tensor splits the individual matrix math blocks of every single layer across both GPUs simultaneously.
It requires a lot of communication between the cards. If GPUs are connected via a slow motherboard slot, the overhead can very have negative effects on the performance.
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:UD-IQ2_XXS --split-mode tensor --main-gpu 1 -n 512 -d 8192,32768,65536
| model | size | params | backend | ngl | main_gpu | sm | test | t/s |
|---|---|---|---|---|---|---|---|---|
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | pp512 @ d8192 | 953.70 ± 18.05 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | tg512 @ d8192 | 50.23 ± 0.06 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | pp512 @ d32768 | 757.52 ± 11.50 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | tg512 @ d32768 | 44.35 ± 0.07 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | pp512 @ d65536 | 595.87 ± 7.08 |
| gemma4 31B IQ2_XXS - 2.0625 bpw | 7.93 GiB | 30.70 B | CUDA | -1 | 1 | tensor | tg512 @ d65536 | 40.01 ± 0.06 |
Results are promising, using tensor almost doubles the performance, of course the extreme PCIe overhead will not allow it to be exactly double the performance.
4-bit on dual GPU
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -hf unsloth/gemma-4-31B-it-GGUF:Q4_0 -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 -n 512 -d 8192,32768,65536,131072
and to test bigger context window:
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:IQ4_XS -n 512 -d 163840
llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL -n 512 -d 8192,32768,65536,131072
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 841.03 ± 4.60 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 21.78 ± 0.00 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 558.09 ± 1.25 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 19.67 ± 0.00 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 385.18 ± 0.66 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 17.85 ± 0.00 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 237.50 ± 0.22 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 14.34 ± 0.00 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | pp512 @ d163840 | 197.36 ± 0.18 |
| gemma4 31B IQ4_XS - 4.25 bpw | 15.23 GiB | 30.70 B | CUDA | -1 | tg512 @ d163840 | 13.29 ± 0.00 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 824.93 ± 4.40 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 20.84 ± 0.00 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 551.16 ± 1.25 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 18.90 ± 0.00 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 382.09 ± 0.65 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 17.22 ± 0.00 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 236.39 ± 0.22 |
| gemma4 31B Q4_0 | 16.13 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 13.93 ± 0.00 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 780.59 ± 3.65 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 19.16 ± 0.00 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 531.21 ± 1.25 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 17.51 ± 0.00 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 372.39 ± 0.61 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 16.05 ± 0.00 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | pp512 @ d131072 | 230.52 ± 0.26 |
| gemma4 31B Q4_1 | 17.79 GiB | 30.70 B | CUDA | -1 | tg512 @ d131072 | 13.14 ± 0.00 |
5-bit on dual GPU
Model unsloth/gemma-4-31B-it-GGUF:Q5_K_XL might be too big for context window bigger than 64k. While medium and small variants can have ~90k context window.
So I run just medium and small variants:
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q5_K_M -hf unsloth/gemma-4-31B-it-GGUF:Q5_K_S -n 256,512 -d 8192,32768,65536,95000
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 731.06 ± 3.07 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg256 @ d8192 | 17.11 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 17.09 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 507.55 ± 1.17 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg256 @ d32768 | 15.82 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 15.81 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 360.52 ± 0.56 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg256 @ d65536 | 14.58 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 14.57 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | pp512 @ d95000 | 282.92 ± 0.45 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg256 @ d95000 | 13.26 ± 0.00 |
| gemma4 31B Q5_K - Medium | 20.16 GiB | 30.70 B | CUDA | -1 | tg512 @ d95000 | 13.26 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | pp512 @ d8192 | 753.69 ± 1.99 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg256 @ d8192 | 17.45 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg512 @ d8192 | 17.43 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | pp512 @ d32768 | 518.44 ± 1.11 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg256 @ d32768 | 16.11 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg512 @ d32768 | 16.10 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | pp512 @ d65536 | 366.11 ± 0.57 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg256 @ d65536 | 14.82 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg512 @ d65536 | 14.82 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | pp512 @ d95000 | 288.83 ± 0.37 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg256 @ d95000 | 13.47 ± 0.00 |
| gemma4 31B Q5_K - Small | 19.66 GiB | 30.70 B | CUDA | -1 | tg512 @ d95000 | 13.47 ± 0.00 |
Irrelevant
As mentioned above these are now not as important thanks to Gemma 4 Quantization-Aware-Training variants, where now the Q4 variant provides better perplexity than before and instead of speculating if I should run Q5 or Q4_0 or Q4_1, now the answer is simpler, the qat_Q4 should be better than these. More in the follow-up article. But still will leave it in place in case somebody will find it interesting, like the performance scaling down depending on the context size, etc.

