Comment by porridgeraisin
Comment by porridgeraisin 7 days ago
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
Comment by porridgeraisin 7 days ago
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
I'm not aware of any GPU that implements this.
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
if (theradIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
Z;
The diagram shows how this executes in the following order:Volta:
->| ->X ->Y ->Z|->
->|->A ->B ->Z |->
pre Volta: ->| ->X->Y|->Z
->|->A->B |->Z
The SIMD equivilant of pre Volta is: vslt mask, vid, 4
vopA ..., mask
vopB ..., mask
vopX ..., ~mask
vopY ..., ~mask
vopZ ...
The Volta model is: vslt mask, vid, 4
vopA ..., mask
vopX ..., ~mask
vopB ..., mask
vopY ..., ~mask
vopZ ...
[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...
IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.
This paper
https://arxiv.org/abs/2407.02944
ventures some guesses how Nvidia does this, and runs experiments to confirm them.
I was referring to this portion of TFA
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
Thanks, I will look into that.
However, I'm still confused about the original statement. What I had thought was that
pre-volta GPUs, each thread in a warp has to execute in lock-step. Post-volta, they can all execute different instructions.
Obviously this is a surface level understanding. How do I reconcile this with what you wrote in the other comment and this one?
SIMT is just a programming model for SIMD.
Modern GPUs still are just SIMD with good predication support at ISA level.