Comment by nico

Comment by nico 2 months ago

Great work. This seems very interesting, but I need something slightly more high level to relate to it

Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?

What’s the ideal use case for local models?

Thank you

dipampaul17 2 months ago

With the K8V4 configuration providing 59% memory savings, you can effectively run contexts 2.4× longer on the same hardware. A model with a 2048 token context can now handle about 5000 tokens, while an 8K context model can reach approximately 19.5K tokens.

In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.

The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.

This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.

Reply View 0 replies

kmacdough 2 months ago

> Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window

It reduces the memory footprint of a particular model. You can do what you like with that. Extending the context window post-training isn't trivial, so unless you know what you're doing, you'd be better off finding a model trained on a larger context window.

Many uses for local models like working offline or privacy/security. Most folks, though, are using it to experiment with tweaking models.

Reply View 3 replies

nico 2 months ago

Will that make the model run/feel faster?
I can run models with 30-40b parameters on my computer, but they feel a lot slower than the 1-7b ones
So would this make the 30-40b parameter modes run faster? Or at least “feel” faster?

Reply View | 2 replies
- fennecbutt 2 months ago
  
  No, only more compute or fancy model architecture tweaks will get you more t/s.
  However if using discrete gpu, reducing KV memory lets you load more layers onto gpu and therefore more performance, but only if you're already struggling to fit your model into vram.
  
  Reply View | 1 reply
  
  dipampaul17 2 months ago
  
  For 30-40B parameter models, you'll see two types of performance impacts:
  First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
  However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
  Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
  The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
  @fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.
  
  Reply View | 0 replies