Comment by jms55

Comment by jms55 6 months ago

The weird part of the programming model is that threadblocks don't map 1:1 to warps or SMs. A single threadblock executes on a single SM, but each SM has multiple warps, and the threadblock could be the size of a single warp, or larger than the combined thread count of all warps in the SM.

So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.

People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.

Two articles on the AMD side of things:

https://gpuopen.com/learn/occupancy-explained

https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...

bassp 6 months ago

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

Reply View 9 replies

einpoklum 6 months ago

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.
You were taught wrong...
First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".
Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:
Amount of shared memory used per warp
* Makeup of the instructions to be executed
* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

Reply View | 3 replies
- bassp 6 months ago
  
  Sorry if I was sloppy with my wording, instruction issuance is what I meant :)
  I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?
  
  Reply View | 1 reply
  
  einpoklum 6 months ago
  
  > warps weren't issued instructions unless they were ready to execute
  This is true, but after they've been issued, it still takes a while for the execution to conclude.
  > it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once
  Just replace "most" with "some". It really depends on what kind of kernel you're writing.
  
  Reply View | 0 replies
- delifue 6 months ago
  
  The GPU Glossary mentions that a warp scheduler can context switch https://modal.com/gpu-glossary/device-hardware/warp-schedule... but you said there is no such thing as an SM "context switch between threads". Is there some ambiguity in context switch
  
  Reply View | 0 replies
saagarjha 6 months ago

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.

Reply View | 4 replies
- bassp 6 months ago
  
  You can request up to 1024-2048 threads per block depending on the gpu; each SM can execute between 32 and 128 threads at a time! So you can have a lot more threads assigned to an SM than the SM can run at once
  
  Reply View | 1 reply
  
  saagarjha 6 months ago
  
  Right, ok. So you mean a handful of warps and not like a plethora of them for no reason.
  
  Reply View | 0 replies
- buildbot 6 months ago
  
  Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads
  https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...
  
  Reply View | 1 reply
  
  saagarjha 6 months ago
  
  Yeah I’m talking about the limit per-block.
  
  Reply View | 0 replies

charles_irl 6 months ago

100% -- there's basically no substitue for benchmarking! I find the empiricism kind of comforting, coming from a research science background.

IIUC, even CuBLAS basically just uses a bunch of heuristics that are mostly derived from benchmarking to decide with kernels to use.

Reply View 0 replies

einpoklum 6 months ago

> It's a huge pain, and very sensitive to code changes.

Optimization is very often like that. Making things generic, uniform and simple typically has a performance penalty - and you use your GPU because you care about that stuff.

Reply View 0 replies

EarlKing 6 months ago

Sounds like the sort of thing that would lend itself to runtime optimization.

Reply View 9 replies

jms55 6 months ago

I'm not too informed on the details, but iirc drivers _do_ try and optimize shaders in the background, and then when ready swaps in a better version. But I doubt it does stuff like change threadgroup size, the programmer might assume a certain size and their shader would be broken if changed. Also drivers doing background work means unpredictable performance and stuttering, which developers really don't like.
Someone correct me if I'm wrong, maybe drivers don't do this anymore.

Reply View | 5 replies
- EarlKing 6 months ago
  
  Well, if the user isn't going to be sharing the GPU with another task then you could push things back to install-time. In other words: At install time you conduct a benchmark on the relevant shaders, rewrite as necessary, recompile, and save the results accordingly. Now the user has a version of your shaders optimized to their particular configuration. Since installation times are already somewhat lengthy anyway you can be reasonably certain that no one is going to miss an extra minute or two needed to conduct benchmarks, especially if it results in installing optimized code.
  
  Reply View | 4 replies
  
  charles_irl 6 months ago
  
  Coming from the neural network world, rather than the shader world, but: I'd say you're absolutely right!
  Right now NNs and their workloads are changing quickly enough that people tend to prefer runtime optimization (like the dynamic/JIT compilation provided by Torch's compiler), but when you're confident you understand the workload and have the know-how, you can do static compilation (e.g. with ONNX, TensorRT).
  I work on a serverless infrastructure product that gets used for NN inference on GPUs, so we're very interested in ways to amortize as much of that compilation and configuration work as possible. Maybe someday we'll even have something like what Redshift has in their query engine -- pre-compiled binaries cached across users.
  
  Reply View | 0 replies
  
  lostmsu 6 months ago
  
  This reminds me of the dreaded Vulkan Shaders Compilation dialog when you try to play some games after driver update.
  
  Reply View | 1 reply
  
  terribleperson 6 months ago
  
  People complain a lot about shader compilation, but shader compilation on start-up is much nicer than when a game doesn't do that ahead of time and does it when you need those shaders.
  
  Reply View | 0 replies
  
  saagarjha 6 months ago
  
  This is how autotuning often works yes
  
  Reply View | 0 replies
amelius 6 months ago

But which programming languages are most amenable to automatic runtime optimization?
Should we go back to FORTRAN?

Reply View | 2 replies
- EarlKing 6 months ago
  
  The sad answer is... probably none of them. Runtime optimization has always been one of those things that sends most programmers running away screaming, and those who make languages never seem to come from the ranks of those who understand the clear utility of it.
  
  Reply View | 0 replies
- morphle 6 months ago
  
  Squeak Smalltalk has several automatic runtime optimizations and compilers like JIT, parallel load balancing compiler [1], adaptive compiler [2] and a metacircular simulator and byte code virtual machine written in itself that allows you to do runtime optimisations on GPUs. The byte codes are of course replaced with the native GPU instructions at runtime.
  There are dozens of scientific papers and active research is still being done [1].
  I've worked on automatic parallel runtime optimizations and adaptive compilers since 1981. We make reconfigurable hardware (chips and wafers) that also optimises at runtime.
  Truffle/GraalVM is very rigid and overly complicated [6].
  With a meta compiler like Ometa or Ohm we can give any programming language the runtime adaptive compilation for GPUs [3][4].
  I'm currently adapting my adaptive compiler to Apple Silicon M4 GPU and neural engine to unlock the trillions of operations per second these chips can do.
  I can adapt them to more NVIDIA GPUs with the information of the website in the title. Thank you very much charles_irl! I would love to be able to save the whole website in a single PDF.
  I can optimise your GPU software a lot with my adaptive compilers. It will cost less than 100K in labour to speed up your GPU code by a factor 4-8 at least, sometimes I see 30-50 times speedup.
  [1] https://www.youtube.com/watch?v=wDhnjEQyuDk
  [2] https://www.youtube.com/watch?v=CfYnzVxdwZE
  [3] https://tinlizzie.org/~ohshima/shadama2/
  [4] https://github.com/yoshikiohshima/Shadama
  [5] http://www.tinlizzie.org/ometa/
  [6] https://github.com/NVIDIA/grcuda
  
  Reply View | 0 replies