Comment by lsb
That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.
That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.
No the compiled version is actually faster.
From that table, the A100 tok/sec (larger is faster) numbers are:
- Eager: 28
- Compiled: 128
And
- KV cache eager: 26
- KV cache compiled: 99
The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly
Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)