Comment by jms55
Comment by jms55 4 days ago
The weird part of the programming model is that threadblocks don't map 1:1 to warps or SMs. A single threadblock executes on a single SM, but each SM has multiple warps, and the threadblock could be the size of a single warp, or larger than the combined thread count of all warps in the SM.
So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.
People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.
Two articles on the AMD side of things:
https://gpuopen.com/learn/occupancy-explained
https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...
I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.
There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).