Comment by nomel

Comment by nomel 2 days ago

2 replies

> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.

smcleod 2 days ago

Yeah I get that, that's what we yse k/v cache quantisation for now which has a lower impact on PPL than this unless I'm missing something?

  • dipampaul17 2 days ago

    You're right to question the perplexity impact - 0.86% isn't negligible. Our extended testing shows this impact remains fairly consistent across context lengths up to 16K, which was our test limit.

    We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.

    The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.

    For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.

    @smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.