- /
-
- Blog /
- Compressing KV cache
Compressing KV cache
And expanding the context window
2026-06-11
I was running out of memory when going withe gemm4-31b:q4 just tiny bit over 128k context window. So decided to use q8 quantization on the KV cache. And now I still max out ram but with upto 210k context window:
llama-bench -hf unsloth/gemma-4-31B-it-qat-GGUF:UD-Q4_K_XL --split-mode tensor --flash-attn on -ctk q8_0 -ctv q8_0 -n 512 -d 194560,215040
1 | model | size | params | backend | ngl | type_k | type_v | sm | fa | test | t/s |
2 | ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | --: | --------------: | -------------------: |
3 | gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | q8_0 | q8_0 | tensor | 1 | pp512 @ d194560 | 316.88 ± 2.48 |
4 | gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | q8_0 | q8_0 | tensor | 1 | tg512 @ d194560 | 12.54 ± 0.00 |
5 | gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | q8_0 | q8_0 | tensor | 1 | pp512 @ d215040 | 294.93 ± 2.08 |
6 | gemma4 31B Q4_0 | 16.09 GiB | 30.70 B | CUDA | -1 | q8_0 | q8_0 | tensor | 1 | tg512 @ d215040 | 11.73 ± 0.00 |
