Comment by monocasa
One piece I would find interesting to see data on, but don't really know how to get meaningful information on against modern, non-academic cores is the fact that reorder buffers I've seen aren't just arrays of single instructions (and dependency metadata), but instead look more like rows of small traces of a few alu/ldst/etc instructions, and one control flow instruction per row. It kind of ends up looking a lot like a a modern CISC microcode word with a triad/quad/etc of operations and a sequencing op (but in this case the sequencing op is for the main store, not the microprogram store). That means in some cases, you have to fill the ROB entries with NOPs (that to be fair don't actually hit the backend) in order to account for a control flow op in an inopportune place.
The conventional wisdom is that conditional moves mainly uplift in order pipelines, but I feel like there could be a benefit to increased ROB residency on OoOE cores as well with the right architecture.
But like I said, I don't have a good way to prove that or not.
The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.
At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.
But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.
If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.
--------
So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.
But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.