Comment by phire
The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.
At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.
But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.
If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.
--------
So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.
But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.
Totally agree. I have experienced 'ideal' circumstances of 33% taken/untaken branches where you will be hard pressed to make cmov perform better on real life workloads. Pass along other data inputs that do predict better and your cmov becomes a liability.
It's pretty hard to make modern compilers reliably emit cmovs in my experience. I had to resort to inline asm.