Comment by iamnotagenius

Comment by iamnotagenius 5 months ago

Interesting, but not exactly practical for a local LLM user, as 4-bit is how LLM's are run locally.

sroussey 5 months ago

True, but their research did include running on 5080 local.

The big take away, in my opinion, is that their technique for LUTs etc could also be applied to lossy quants as well. Say maybe you get 5bit accuracy in size of 4bit?

I don’t know, but maybe? Also their two stage design might make current quantized you kernal designs better.

Reply View 2 replies

spindump8930 5 months ago

Yes, it could be stacked on quants. It might be that quantized activations already are more "dense" and so they can't be compressed as much (from 16 -> ~11 bits), but certainly possible.

Reply View | 1 reply
- jasonjmcghee 5 months ago
  
  I read it similarly - that this is a specific attribute of bfloat16, so the quants folks tend to run on local hardware don't have the same inefficiency to exploit
  
  Reply View | 0 replies

gojomo 5 months ago

Some might prefer the fidelity of this method's 70% savings over the lossyness of 4-bit quantization's 75%.

And, maybe the methods stack for those willing to trade both costs for the smallest representation.

Reply View 2 replies

svachalek 5 months ago

This is only a 30% savings, which is a cool technical feat but hard to see a use case for.

Reply View | 0 replies
iamnotagenius 5 months ago

[dead]

Reply View | 0 replies