Comment by jhj

Comment by jhj a day ago

11 replies

This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).

Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.

This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.

Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).

iandanforth 21 hours ago

For those who don't bother to click through profiles, Jeff really knows what he's talking about. Much of Meta/FAIR + community benefits from his code.

  • VladVladikoff 19 hours ago

    I really love HN for this reason. Full of some of the brightest minds on the internet. Often the comments have very interesting information, instead of stupid knee jerk reactions to post titles.

vessenes 18 hours ago

Thanks Jeff -- can you point me to something written up about rANS? All I find on line is turbulence modeling solutions; I presume this is not what you're referring to.

As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?

The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.

  • zorgmonkey 18 hours ago

    I don't know of any great write-ups unfortunately, but the rANS you're looking for is range asymmetric numeral systems.

bjornsing 14 hours ago

> if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions)

I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.

  • ironbound 11 hours ago

    The Deepseek v3 paper details a quantisation method of scaling after matmul but before accumulation to improve precision, this is different than normal GEMM as operations are left till the end, can read more in chapter 3.3 of the paper below.

    https://arxiv.org/html/2412.19437v2#S3

refibrillator 16 hours ago

Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

Classic comp sci tradeoff between space and speed, no free lunch, etc.

liuliu 13 hours ago

That let you think if we can rewind the time, maybe we should just allocate one more bit for half precision (6 exp, 9 mantissa) and not doing this bfloat16 thing.

brookst 15 hours ago

Thanks for the fantastic explanation!

Would it be more efficient to calculate some kind of per-model or per-layer mean, and then only specify standard deviations, maybe by fp8 or smaller?

hinkley 18 hours ago

Do you think there’s a call for introducing an even smaller float that can pack more values into a SIMD register? Like a 12 bit?

  • boulos 15 hours ago

    The latest GPUs and TPUs support fp8. It's a big part of the efficiency gain in the latest systems. Blackwell also supports fp4.