Comment by sparkie
In some cases you want to do the opposite - to utilize SIMD.
With AVX-512 for example, trivial branching can be replaced with branchless code using the vector mask registers k0-k7, so an if inside a for is better than the for inside the if, which may have to iterate over a sequence of values twice.
To give a basic example, consider a loop like:
for (int i = 0; i < length ; i++) {
if (values[i] % 2 == 1)
values[i] += 1;
else
values[i] -= 2;
}
We can convert this to one which operates on 16 ints per loop iteration, with the loop body containing no branches, where each int is only read and written to memory once (assuming length % 16 == 0). __mmask16 consequents;
__mmask16 alternatives;
__mm512i results;
__mm512i ones = _mm512_set1_epi32(1);
__mm512i twos = _mm512_set1_epi32(2);
for (int i = 0; i < length ; i += 16) {
results = _mm512_load_epi32(&values[i]);
consequents = _mm512_cmpeq_epi32_mask(_mm512_mod_epi32(results, twos), ones);
results = _mm512_mask_add_epi32(results, consequents, results, ones);
alternatives = _knot_mask16(consequents);
results = _mm512_mask_sub_epi32(results, alternatives, results, twos);
_mm512_store_epi32(&values[i], results);
}
Ideally, the compiler will auto-vectorize the first example and produce something equivalent to the second in the compiled object.
I am not sure that the before case maps to the article's premise, and and I think your optimized SIMD version does line up with the recommendations of the article.
For your example loop, the `if` statements are contingent on the data; they can't be pushed up as-is. If your algorithm were something like:
then I think you'd agree that we should hoist that check out above the `for` statement.In your optimized SIMD version, you've removed the `if` altogether and are doing branchless computations. This seems very much like the platonic ideal of the article, and I'd expect they'd be a big fan!