Comment by therealsmith
Comment by therealsmith 2 days ago
Am I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.
Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.
And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075
> Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0`
I think you meant ‘--cache-type-v q4_0’
I would also like an explanation for what’s different in this patch compared to the standard command line arguments.