Comment by einpoklum
> that can itself perform actual SIMD instructions?
Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.
In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.
There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.
Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....