Comment by porridgeraisin

camel-cdr 6 days ago

SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

Reply View 9 replies

achierius 6 days ago

That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.

Reply View | 5 replies
- camel-cdr 6 days ago
  
  I'm not aware of any GPU that implements this.
  Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
  Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
  if (theradIdx.x < 4) { A; B; } else { X; Y; } Z;
  The diagram shows how this executes in the following order:
  Volta:
  ->| ->X ->Y ->Z|-> ->|->A ->B ->Z |->
  pre Volta:
  ->| ->X->Y|->Z ->|->A->B |->Z
  The SIMD equivilant of pre Volta is:
  vslt mask, vid, 4 vopA ..., mask vopB ..., mask vopX ..., ~mask vopY ..., ~mask vopZ ...
  The Volta model is:
  vslt mask, vid, 4 vopA ..., mask vopX ..., ~mask vopB ..., mask vopY ..., ~mask vopZ ...
  [1] https://chipsandcheese.com/i/138977322/shader-execution-reor...
  [2] https://stackoverflow.com/questions/70987051/independent-thr...
  
  Reply View | 1 reply
  
  namibj 5 days ago
  
  IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
  Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.
  
  Reply View | 0 replies
- adrian_b 6 days ago
  
  "Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
  "Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
  SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
  What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
  Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
  What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.
  
  Reply View | 2 replies
  
  GregarianChild 4 days ago
  
  This paper
  https://arxiv.org/abs/2407.02944
  ventures some guesses how Nvidia does this, and runs experiments to confirm them.
  
  Reply View | 0 replies
  
  HumanOstrich 5 days ago
  
  "threads"
  
  Reply View | 0 replies
porridgeraisin 6 days ago

I was referring to this portion of TFA
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.

Reply View | 2 replies
- adrian_b 6 days ago
  
  This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
  For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
  
  Reply View | 1 reply
  
  porridgeraisin 5 days ago
  
  Thanks, I will look into that.
  However, I'm still confused about the original statement. What I had thought was that
  pre-volta GPUs, each thread in a warp has to execute in lock-step. Post-volta, they can all execute different instructions.
  Obviously this is a surface level understanding. How do I reconcile this with what you wrote in the other comment and this one?
  
  Reply View | 0 replies