Comment by camel-cdr
I'm not aware of any GPU that implements this.
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
if (theradIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
Z;
The diagram shows how this executes in the following order:Volta:
->| ->X ->Y ->Z|->
->|->A ->B ->Z |->
pre Volta: ->| ->X->Y|->Z
->|->A->B |->Z
The SIMD equivilant of pre Volta is: vslt mask, vid, 4
vopA ..., mask
vopB ..., mask
vopX ..., ~mask
vopY ..., ~mask
vopZ ...
The Volta model is: vslt mask, vid, 4
vopA ..., mask
vopX ..., ~mask
vopB ..., mask
vopY ..., ~mask
vopZ ...
[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...
IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.