Comment by therealsmith

Comment by therealsmith 2 days ago

3 replies

Am I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.

Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.

And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075

Aurornis 2 days ago

> Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0`

I think you meant ‘--cache-type-v q4_0’

I would also like an explanation for what’s different in this patch compared to the standard command line arguments.

[removed] 2 days ago
[deleted]