Comment by fennecbutt

Comment by fennecbutt 2 days ago

1 reply

No, only more compute or fancy model architecture tweaks will get you more t/s.

However if using discrete gpu, reducing KV memory lets you load more layers onto gpu and therefore more performance, but only if you're already struggling to fit your model into vram.

dipampaul17 2 days ago

For 30-40B parameter models, you'll see two types of performance impacts:

First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.

However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.

Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.

The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.

@fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.