Comment by brucehoult

Comment by brucehoult 4 days ago

3 replies

> some SiFive cores implement exactly this fusion.

I was not able to open the given link, but it's not true, at least for the U74.

Fusion means that one or more instructions are converted to one internal instruction (µop).

SiFive's optimisation [1] of a short forward conditional branch over exactly one instruction has both instructions executing as normal, the branch in pipe A and the other instruction simultaneously in pipe B. At the final stage if the branch turns out to be taken then it is not in fact physically taken, but is instead implemented by suppressing the register write-back of the 2nd instruction.

There are only a limited set of instructions that can be the 2nd instruction in this optimisation, and loads and stores do not qualify. Only simple register-register or register-immediate ALU operations are allowed, including `lui` and `auipc` as well as C aliases such as `c.mv` and `c.li`

> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

The presented code ...

      mv rd, x0
      beq rs2, x0, skip_next
      mv rd, rs1
    skip_next:
... vs ...

    czero.eqz rd, rs1, rs2
... requires that not only rd != rs2 (as stated) but also that rd != rs1. A better implementation is ...

      mv rd, rs1 // safe even if they are the same register
      bne rs2, x0, skip
      mv rd, x0
    skip:
The RISC-V memory consistency model does not come into it, because there are no loads or stores.

Then switching to code involving loads and stores is completely irrelevant:

      lw x1, 0(x2)
      bne x1, x0, next
    next:
      sw x3, 0(x4)
First of all, this code is completely crazy because the `bne` is fancy kind of `nop` and a core could convert it to a canonical `nop` (or simply drop it).

Even putting the `sw` between the `bne` and the label is ludicrous. There is no branch-free code that does the same thing -- not only in RISC-V but also in arm64 or amd64. SiFive's optimisation will not trigger with a store in that position.

[1] SiFive materials consistently describe it as an optimisation not as fusion e.g. in the description of the chicken bits CSR in the U74 core complex manual.

sxzygz 3 days ago

Thanks for your input. I didn’t know what to make of the article.

  • brucehoult 2 days ago

    Having taken a second look, this article does in fact have a point, but it is actually nothing at all to do with conditional moves in the RISC-V instruction set Zicond extension -- or amd64 or arm64 style conditional moves either, if they were added at some point.

    It is not even about RISC-V but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86. I'm not as familiar with the Aarch64 memory model, but I think this probably also applies to it.

    The point here is that if an aggressive implementation wants to implement instruction fusion that removes conditional branches (or indirect branches) to make a branch-free µop -- for example, to turn a conditional branch over a move into something similar to the `czero` instruction -- then in order to maintain memory ordering AS SEEN BY A DIFFERENT CORE the fused µop has to also have `fence r,w` properties.

    That is all.

    It is irrelevant to this whether the actual RISC-V instruction set has a conditional move instruction, or the properties it has if it exists.

    It is irrelevant to the situation where a human programmer or a compiler might choose to transform branchy code into branch-free code. They have a more global view of the program and can make sure things make sense. A CPU core implementing fusion has only a local view.

    Finally, I'll note that instruction fusion is at present hypothetical in RISC-V processors that you can buy today while it has been used in both x86 and Arm chips for a long time.

    Intel's "Core" µarch had fusion of e.g. `cmp;bCC` sequences in 2006, while AMD added it with Bulldozer in 2011. Arm introduced a limited capability -- `CMP r0, #0; BEQ label` is given as an example -- in A53 in 2012 and A57, A72 etc expanded the generality.

    Upcoming RISC-V cores from companies such as Ventana and Tenstorrent are believed to implement instruction fusion for some cases.

    Just for completeness, I'll again repeat that SiFive's U74 optimises execution of a condition branch and a following simple ALU instruction that execute simultaneously in two pipelines, but this is NOT fusion into a single µop.

    • phire 10 hours ago

      > but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86

      No... It's kind of an artefact of RISC-V's memory model being weak. x86 side-steps the issue because it insists that stores always occur in program order, allowing it to fuse away conditional branches without issue.

      (Note: the actual hardware implementation of x86 cpus issues the stores anyway, and then rewinds if it later detects a memory ordering violation)

      RISC-V ran into this corner case because it wanted the best of both worlds: A Weak memory model, but still have strong ordering across branches.

      Looks like ARM avoided this issue because its memory model is weaker, branches don't force any ordering, which means the arm compiler might need to insert a few extra memory barrier instructions.

      ---------

      TBH, I don't think this fusing instructions edge case is a big deal. For smaller RISC-V cores, you aren't reordering memory operations in the first place.

      And for larger RISC-V cores, you already need a complex mechanism for dealing with store order violationss, so you just throw your fused come instruction at it. Your core already needs to deal with sync points that aren't proper branches, because non-taken branches also enforce ordering.