Comment by kmacdough

Comment by kmacdough 2 months ago

> Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window

It reduces the memory footprint of a particular model. You can do what you like with that. Extending the context window post-training isn't trivial, so unless you know what you're doing, you'd be better off finding a model trained on a larger context window.

Many uses for local models like working offline or privacy/security. Most folks, though, are using it to experiment with tweaking models.

nico 2 months ago

Will that make the model run/feel faster?

I can run models with 30-40b parameters on my computer, but they feel a lot slower than the 1-7b ones

So would this make the 30-40b parameter modes run faster? Or at least “feel” faster?

Reply View 2 replies

fennecbutt 2 months ago

No, only more compute or fancy model architecture tweaks will get you more t/s.
However if using discrete gpu, reducing KV memory lets you load more layers onto gpu and therefore more performance, but only if you're already struggling to fit your model into vram.

Reply View | 1 reply
- dipampaul17 2 months ago
  
  For 30-40B parameter models, you'll see two types of performance impacts:
  First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
  However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
  Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
  The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
  @fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.
  
  Reply View | 0 replies