Comment by bassp
I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.
There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).
> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.
You were taught wrong...
First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".
Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:
Amount of shared memory used per warp
* Makeup of the instructions to be executed
* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).