Comment by lsb

Comment by lsb 5 days ago

6 replies

That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.

ladberg 5 days ago

Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there

  • ModelForge 5 days ago

    No the compiled version is actually faster.

    From that table, the A100 tok/sec (larger is faster) numbers are:

    - Eager: 28

    - Compiled: 128

    And

    - KV cache eager: 26

    - KV cache compiled: 99

    The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

    • ladberg 4 days ago

      Ah yep read the labels backwards and meant that - ty for catching and for the explanation

punnerud 5 days ago

Because on Mac the CPU and GPU share memory, but A100 need to transfer to RAM/CPU on the parts that’s not supported by GPU?

(My first guess)

Weryj 5 days ago

This would be because the GPU can’t fill its waveform and hide memory latency, no? I’m curious for a reason why