Comment by petermcneeley

Comment by petermcneeley 3 days ago

0 replies

In actual implementation they are very much like very wide SIMD on a CPU core. Each HW thread is a different warp as each warp can execute different instructions.

This mapping is so close that translation from GPU to CPU relatively easy and performant.