Comment by moffkalast

Comment by moffkalast 3 months ago

Not as big when Q8 quantization is already considered overkill and cuts it down to 50% (and a flat 2x speed boost without any additional compute overhead mind you) and the more common Q4KM is more like 30%. Definitely interesting if it can be added to existing quantization, but K quants do already use different precision levels for different layers depending on general perplexity impact which is similar to this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not even considering calibrated imatrix which does something conceptually similar to FFT to compress even higher.

janalsncm 3 months ago

Quantization is not lossless.

Reply View 16 replies

danielmarkbruce 3 months ago

Nobody really cares if it meets a strict definition of lossless.

Reply View | 15 replies
- BoorishBears 3 months ago
  
  I do? I spend a ton of time post-training models for creative tasks.
  The effects of model quantization are usually qualified in terms of performance on benchmaxxed tasks with strong logit probabilities, temp 0, and a "right" answer the model has to pick. Or even worse they'll be measured on metrics that don't map to anything except themselves like perplexity (https://arxiv.org/pdf/2407.09141)
  I agree Q8 is strong but I also think the effects of quantization are constantly being underappreciated. People are often talking about how these models perform while fundamentally using 10+ variants of a single model with distinct performance profiles.
  Even knowing the bits per weight used isn't enough to know how exactly a given quant method is affecting the model: https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs
  
  Reply View | 6 replies
  
  imtringued 3 months ago
  
  If you've trained your own models you would be aware of quantization aware training.
  
  Reply View | 0 replies
  
  danielmarkbruce 3 months ago
  
  "Nobody really cares if it meets a strict definition of lossless" != "quantization can be done haphazardly."
  
  Reply View | 4 replies
- moffkalast 3 months ago
  
  And when you consider that the usual final step in the pipeline is that a sampler goes ham on the probabilities and just picks some random nonsense, the tolerance for lossy compression is fairly high.
  In fact, there's this funny occurrence where Q4 models on occasion perform better than their fp16 counterparts on benchmarks ran with top_k=1 since the outputs are slightly more random and they can less deterministically blunder past the local maximum into a more correct solution.
  
  Reply View | 2 replies
  
  Der_Einzige 3 months ago
  
  We got an oral at ICLR for calling out how shit samplers like top_p and top_k are. Use min_p!
  
  Reply View | 1 reply
  
  moffkalast 3 months ago
  
  True yep, I wish more people benchmarked models with more representative sampler settings and then took the average of 5 or 10 responses.
  
  Reply View | 0 replies
- kridsdale3 3 months ago
  
  That's not true. If there are measurable performance differences.
  
  Reply View | 3 replies
  
  danielmarkbruce 3 months ago
  
  "strict" means something. People, including yourself, only care if there is a practical difference in performance. "this is lossless and that isn't lossless" is a completely useless statement in this realm. In many domains lossy compression is either not tolerated, not legal or not practical.
  
  Reply View | 0 replies
  
  kadushka 3 months ago
  
  If you get any accuracy degradation with full 8 bits of precision you're doing it wrong.
  
  Reply View | 1 reply
  
  omneity 3 months ago
  
  Or your model wasn't trained so well (weights are too spiky)
  
  Reply View | 0 replies
- throwaway314155 3 months ago
  
  Seems reductive.
  
  Reply View | 0 replies