Comment by refibrillator
Comment by refibrillator 16 hours ago
Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).
Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.
Classic comp sci tradeoff between space and speed, no free lunch, etc.