Comment by zmmmmm
Amazing!
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
Amazing!
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
I think this is true, I've found I get roughly the same iteration speed for prompt processing no matter if the cache is fp16, q8 or q4.
It doesn't make sense to me though, I haven't looked into how it works inside but I would've thought it would pack vectors and then do 4-8b simd on all of them at once, but it really seems like it's not packing em.