Comment by ondra
Is this any different from using --cache-type-k and --cache-type-v?
Is this any different from using --cache-type-k and --cache-type-v?
I'm guessing it's a bit different since MLX/MPS doesn't have native 4-bit support (or even 8 if I remember correctly?) It didn't launch with bf16 support even. So I think the lowest you could go on the old type_k/v solution and apple GPUs was 16-bit f16/bf16 but not a llama.cpp internals expert so maybe wrong?
No, it appears to be an LLM-generated attempt to gain GitHub stars.
See my other comment for a sampling of the other oddities in the repo.