Comment by einpoklum
> warps weren't issued instructions unless they were ready to execute
This is true, but after they've been issued, it still takes a while for the execution to conclude.
> it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once
Just replace "most" with "some". It really depends on what kind of kernel you're writing.