Comment by monocasa

Comment by monocasa 12 hours ago

4 replies

One piece I would find interesting to see data on, but don't really know how to get meaningful information on against modern, non-academic cores is the fact that reorder buffers I've seen aren't just arrays of single instructions (and dependency metadata), but instead look more like rows of small traces of a few alu/ldst/etc instructions, and one control flow instruction per row. It kind of ends up looking a lot like a a modern CISC microcode word with a triad/quad/etc of operations and a sequencing op (but in this case the sequencing op is for the main store, not the microprogram store). That means in some cases, you have to fill the ROB entries with NOPs (that to be fair don't actually hit the backend) in order to account for a control flow op in an inopportune place.

The conventional wisdom is that conditional moves mainly uplift in order pipelines, but I feel like there could be a benefit to increased ROB residency on OoOE cores as well with the right architecture.

But like I said, I don't have a good way to prove that or not.

phire 9 hours ago

The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.

If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.

--------

So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.

But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.

  • yvdriess 3 hours ago

    Totally agree. I have experienced 'ideal' circumstances of 33% taken/untaken branches where you will be hard pressed to make cmov perform better on real life workloads. Pass along other data inputs that do predict better and your cmov becomes a liability.

    It's pretty hard to make modern compilers reliably emit cmovs in my experience. I had to resort to inline asm.

  • adgjlsfhk1 8 hours ago

    Doing the translation dynamically would be really interesting. Compilers (with some exception to PGO compiles) do a really bad job figuring out whether to go branchless or branched, and programmers only get it right ~half the time.

  • imtringued an hour ago

    >The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

    That doesn't sound like a very well thought out argument. The moment you are conditional with respect to two independent conditions, you can run both conditional moves in parallel.

    >At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

    The moment you have N parallel branches such as from unrolling a data parallel loop, you have a combinatorial explosion of 2^N possible paths to take. You have to successfully predict through all of them and you certainly can't execute them in parallel anymore.

    Also, you're saying there is a miss predict and a pipeline flush, but those are concepts that relate to prefetching instructions and are completely irrelevant to conditional moves that do not change the instruction pointer. If you have nothing to execute, because you're waiting for a dependent instruction that is currently executing, then you're stalling the pipeline, which is equivalent to executing a NOP instruction. It's a waste of a cycle (not really, because you're waiting for a good reason), but it can't be more expensive than that.