Comment by phire

Comment by phire 9 hours ago

3 replies

The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.

If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.

--------

So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.

But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.

yvdriess 3 hours ago

Totally agree. I have experienced 'ideal' circumstances of 33% taken/untaken branches where you will be hard pressed to make cmov perform better on real life workloads. Pass along other data inputs that do predict better and your cmov becomes a liability.

It's pretty hard to make modern compilers reliably emit cmovs in my experience. I had to resort to inline asm.

adgjlsfhk1 8 hours ago

Doing the translation dynamically would be really interesting. Compilers (with some exception to PGO compiles) do a really bad job figuring out whether to go branchless or branched, and programmers only get it right ~half the time.

imtringued an hour ago

>The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

That doesn't sound like a very well thought out argument. The moment you are conditional with respect to two independent conditions, you can run both conditional moves in parallel.

>At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

The moment you have N parallel branches such as from unrolling a data parallel loop, you have a combinatorial explosion of 2^N possible paths to take. You have to successfully predict through all of them and you certainly can't execute them in parallel anymore.

Also, you're saying there is a miss predict and a pipeline flush, but those are concepts that relate to prefetching instructions and are completely irrelevant to conditional moves that do not change the instruction pointer. If you have nothing to execute, because you're waiting for a dependent instruction that is currently executing, then you're stalling the pipeline, which is equivalent to executing a NOP instruction. It's a waste of a cycle (not really, because you're waiting for a good reason), but it can't be more expensive than that.