Comment by einpoklum

Comment by einpoklum 5 days ago

> that can itself perform actual SIMD instructions?

Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.

In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.

There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.

pklausler 5 days ago

Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....

Reply View 1 reply

einpoklum 4 days ago

The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.

Reply View | 0 replies

shaklee3 5 days ago

not true; there are a lot of simd instructions on GPUs

Reply View 2 replies

einpoklum 4 days ago

Such as?

Reply View | 1 reply
- shaklee3 4 days ago
  
  dp4a, ldg. just Google it. there's a whole page of them
  
  Reply View | 0 replies