Comment by bassp

Comment by bassp 6 months ago

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

einpoklum 6 months ago

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

You were taught wrong...

First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

Amount of shared memory used per warp

* Makeup of the instructions to be executed

* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

Reply View 3 replies

bassp 6 months ago

Sorry if I was sloppy with my wording, instruction issuance is what I meant :)
I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?

Reply View | 1 reply
- einpoklum 6 months ago
  
  > warps weren't issued instructions unless they were ready to execute
  This is true, but after they've been issued, it still takes a while for the execution to conclude.
  > it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once
  Just replace "most" with "some". It really depends on what kind of kernel you're writing.
  
  Reply View | 0 replies
delifue 6 months ago

The GPU Glossary mentions that a warp scheduler can context switch https://modal.com/gpu-glossary/device-hardware/warp-schedule... but you said there is no such thing as an SM "context switch between threads". Is there some ambiguity in context switch

Reply View | 0 replies

saagarjha 6 months ago

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.

Reply View 4 replies

bassp 6 months ago

You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once

Reply View | 1 reply
- saagarjha 6 months ago
  
  Right, ok. So you mean a handful of warps and not like a plethora of them for no reason.
  
  Reply View | 0 replies
buildbot 6 months ago

Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...

Reply View | 1 reply
- saagarjha 6 months ago
  
  Yeah I’m talking about the limit per-block.
  
  Reply View | 0 replies