Comment by nomel
> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.
The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.
Yeah I get that, that's what we yse k/v cache quantisation for now which has a lower impact on PPL than this unless I'm missing something?