Comment by sparkie

Comment by sparkie 2 months ago

In some cases you want to do the opposite - to utilize SIMD.

With AVX-512 for example, trivial branching can be replaced with branchless code using the vector mask registers k0-k7, so an if inside a for is better than the for inside the if, which may have to iterate over a sequence of values twice.

To give a basic example, consider a loop like:

    for (int i = 0; i < length ; i++) {
        if (values[i] % 2 == 1)
            values[i] += 1;
        else
            values[i] -= 2;
    }

We can convert this to one which operates on 16 ints per loop iteration, with the loop body containing no branches, where each int is only read and written to memory once (assuming length % 16 == 0).

    __mmask16 consequents;
    __mmask16 alternatives;
    __mm512i results;
    __mm512i ones = _mm512_set1_epi32(1);
    __mm512i twos = _mm512_set1_epi32(2);
    for (int i = 0; i < length ; i += 16) {
        results = _mm512_load_epi32(&values[i]); 
        consequents = _mm512_cmpeq_epi32_mask(_mm512_mod_epi32(results, twos), ones);
        results = _mm512_mask_add_epi32(results, consequents, results, ones);
        alternatives = _knot_mask16(consequents);
        results = _mm512_mask_sub_epi32(results, alternatives, results, twos);
        _mm512_store_epi32(&values[i], results);
    }

Ideally, the compiler will auto-vectorize the first example and produce something equivalent to the second in the compiled object.

gameman144 2 months ago

I am not sure that the before case maps to the article's premise, and and I think your optimized SIMD version does line up with the recommendations of the article.

For your example loop, the `if` statements are contingent on the data; they can't be pushed up as-is. If your algorithm were something like:

    if (length % 2 == 1) {
      values[i] += 1;
    } else {
      values[i] += 2;
    }

then I think you'd agree that we should hoist that check out above the `for` statement.

In your optimized SIMD version, you've removed the `if` altogether and are doing branchless computations. This seems very much like the platonic ideal of the article, and I'd expect they'd be a big fan!

Reply View 1 reply

sparkie 2 months ago
The point was more that, you shouldn't always try to remove the branch from a loop yourself, because often the compiler will do a better job.
For a contrived example, we could attempt to be clever and remove the branching from the loop in the first example by subtracting two from every value, then add three only for the odds.
for (int i = 0; i < length ; i++) { values[i] -= 2; values[i] += (values[i] % 2) * 3; }
It achieves the same result (because subtracting two preserves odd/evenness, and nothing gets added for evens), and requires no in-loop branching, but it's likely going to perform no better or worse than what the compiler could've generated from the first example, and it may be more difficult to auto-vectorize because the logic has changed. It may perform better than an unoptimized branch-in-loop version though (depending on the cost of branching on the target).
In regards to moving branches out of the loop that don't need to be there (like your check on the length) - the compiler will be able to do this almost all of the time for you - this kind of thing is standard optimization techniques that most compilers implement. If you are interpreting, the following OPs advice is certainly worth doing, but you should probably not worry if you're using a mature compiler, and instead aim to maximize clarity of code for people reading it, rather than trying to be clever like this.
Reply View | 0 replies

William_BB 2 months ago

My first thought was conditional branches inside the for loop based on the element as well. By any chance, do you know how hard it is for compilers to auto-vectorize something like this? I am generally not sure where the boundary is.

Reply View 1 reply

sparkie 2 months ago

GCC can do better than the example I gave.
https://godbolt.org/z/fo74G7d3W

Reply View | 0 replies