RISC-V Conditional Moves
(corsix.org)63 points by gok 4 days ago
63 points by gok 4 days ago
"sufficiently advanced magic core" is a fairly funny term when a lot of other RISCV behavior basically assumes any real processor will have a macro op fusing frontend for specific series of instructions, and even provides recommendations for what fusions cores should implement.
The `xmipscmove` is just the name of the extension. The 'mips' here means MIPS-the-company, and not MIPS-the-ISA. It's supported by the MIPS P8700 CPU which, counterintuitively, is a RISC-V CPU and not MIPS (it's named "MIPS" because the company which designed it is called "MIPS", not because it uses the MIPS architecture).
Surprised an article published on September 28, 2025 does not include any mention of the P (packed integer SIMD) extension.
P adds instructions like integer multiply-accumulate, which have a third register read (for rd). So, they're taking the opportunity to add a few forms of 3-register select instructions:
MVM Move Masked
for each bit i: X(rd)[i] = X(rs2)[i] ? X(rs1)[i] : X(rd)[i]
MVMN Move Masked Not
for each bit i: X(rd)[i] = X(rs2)[i] ? X(rd)[i] : X(rs1)[i]
MERGE Merge
for each bit i: X(rd)[i] = X(rd)[i] ? X(rs2)[i] : X(rs1)[i]
Actually I say I'm surprised but given the way the spec is currently spread around different parts of the internet, it's easy to miss if you're not following the mailing lists!This is implemented with instruction fusion. Just need to document properly and publish properly what will end up "standard instruction fusion patterns" (like the div/rem one).
Adding more instructions is kind of non productive for a R(educed)ISC ISA. It has to be weighted with extreme care. Compressed instructions went thru for the sake of code density (marketing vs arm thumb instructions).
In the end, programs will want probably to stay conservative and will implement only the core ISA, at best giving some love to some instruction fusion patterns and that's it, unless being built knowingly for a specific risc-v hardware implementation.
> In the end, programs will want probably to stay conservative and will implement only the core ISA
This is probably not the case. The core ISA doesn't include floating point, it doesn't include integer multiply or divide, it doesn't include atomic and fence instructions.
What has happened is that most compilers and programs for "normal desktop/laptop/server/phone class systems" all have some baseline set of extensions. Today, this is more or less what we call the "G" extension collection (which is short-hand for IMAFD_Zicsr_Zifencei). Though what we consider "baseline" in "normal systems" will obviously evolve over time (just like how SSE is considered a part of "baseline amd64" these days but was once a new and exotic extension).
Then lower power use cases like MCUs will have fewer instructions. There will be lots of MCUs without stuff like hardware floating point support that won't run binaries compiled for the G extension collection. In MCU use cases, you typically know at the time of compiling exactly what MCU your code will be running on, so passing the right flags to the compiler to make sure it generates only the supported instructions is not an issue.
And then HPC use cases will probably assume more exotic extensions.
And normal "desktop/phone/laptop/server" style use cases will have runtime detection of things like vector instructions in some situations, just like in aarch64/amd64.
Its not known as "G". The standard that is target by the software ecosystem is RVA20, RVA22, RVA23.
https://riscv.org/ecosystem-news/2025/04/risc-v-rva23-a-majo...
> In the end, programs will want probably to stay conservative and will implement only the core ISA
Unlikely, as pointed out in sibling comments the core ISA is too limited. What might prevail is profiles, specifically profiles for application processors like RVA22U64 and RVA23U64, which the latter one makes a lot more sense IMHO.
As an aside: it's only relevant on microcontrollers nowadays, but ARM T32 (Thumb) code density is really good. Most instructions are 2 bytes, and it's got some clever ways to represent commonly used 32-bit values in 12 bits:
https://developer.arm.com/documentation/ddi0403/d/Applicatio...
Not necessarily lower density. On ARM you would often need cmp and csel, which are two instructions, eight bytes.
RISC-V has cmp-and-branch in a single instruction, which with c.mv normally makes six bytes. If the cmp-and-branch instruction tests one of x8..x15 against zero then that could also be a compressed instruction: making four bytes in total.
ARMv8.7 added some new instructions for int min/max to replace cmp+csel. (I'm surprised it took them so long to add popcnt.)
> publish properly what will end up "standard instruction fusion patterns" (like the div/rem one).
The div/rem one is odd because I saw it suggested in the ISA manual, but I have yet to ever see that pattern crop up in compiled code. Usually it's just in library functions like C stdlib `div()` which returns a quotient and remainder, but why on earth are you calling that library function on a processor that has a divide instruction?
> but why on earth are you calling that library function on a processor that has a divide instruction?
Because they rightfully expect that div() compiles down to the fastest div/rem idiom for the target hardware. Mainstream compilers go to great lengths to optimize calls to the core C math functions.
I've always thought instead of compare-and-branch, they should have just made it compare, or a better name would be "if". if r1<r2 execute the next instruction. This should have worked like a 16bit prefix to whatever instructions are supported. Risc-v would have only supported jump, jalr, and branch. Then as they realized the importance of conditional instructions the could have just changed the spec to allow "if" to be combined with load, store, add, etc...
IMHO this approach seems to fit modern CPU designs reasonably well. There is no explicit flag or predicate register, but it does require fusing 2 instructions with possibly different operand. But restricting which instructions can use it might help (even better if its completely orthogonal).
This is what ARM's Thumb-2 has with its various If-Then-Else instructions. One instruction can skip upto four subsequent instructions if the condition fails.
It can also do else clauses, instructions that get executed only when the condition fails.
I'm not sure how well this approach would work on modern CPUs; These days, Thumb-2 is generally only used on small microprocessors, and it's notable that ARM64 didn't carry that feature forwards.
One piece I would find interesting to see data on, but don't really know how to get meaningful information on against modern, non-academic cores is the fact that reorder buffers I've seen aren't just arrays of single instructions (and dependency metadata), but instead look more like rows of small traces of a few alu/ldst/etc instructions, and one control flow instruction per row. It kind of ends up looking a lot like a a modern CISC microcode word with a triad/quad/etc of operations and a sequencing op (but in this case the sequencing op is for the main store, not the microprogram store). That means in some cases, you have to fill the ROB entries with NOPs (that to be fair don't actually hit the backend) in order to account for a control flow op in an inopportune place.
The conventional wisdom is that conditional moves mainly uplift in order pipelines, but I feel like there could be a benefit to increased ROB residency on OoOE cores as well with the right architecture.
But like I said, I don't have a good way to prove that or not.
The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.
At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.
But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.
If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.
--------
So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.
But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.
Doing the translation dynamically would be really interesting. Compilers (with some exception to PGO compiles) do a really bad job figuring out whether to go branchless or branched, and programmers only get it right ~half the time.
I feel like the author might be slightly missing the point here:
> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right.
The idea of Zicond afaict is that the compiler transforms select sequences into (usually multiple) Zicond instructions, and cores with more register ports available can fuse Zicond compounds into more complex select macro-ops. It's a 2R1W vocabulary for describing selects which require more than 2 read ports.
As an aside I evaluated Zicond on my scalar 3-stage implementation and found that at 1 CPI for ALU ops and 2-cycle taken branch cost, the branchless sequences GCC produced for Zicond were never better and sometimes worse than the equivalent branching sequence. It really does seem to be targeting bigger cores, or constant-time execution
> some SiFive cores implement exactly this fusion.
I was not able to open the given link, but it's not true, at least for the U74.
Fusion means that one or more instructions are converted to one internal instruction (µop).
SiFive's optimisation [1] of a short forward conditional branch over exactly one instruction has both instructions executing as normal, the branch in pipe A and the other instruction simultaneously in pipe B. At the final stage if the branch turns out to be taken then it is not in fact physically taken, but is instead implemented by suppressing the register write-back of the 2nd instruction.
There are only a limited set of instructions that can be the 2nd instruction in this optimisation, and loads and stores do not qualify. Only simple register-register or register-immediate ALU operations are allowed, including `lui` and `auipc` as well as C aliases such as `c.mv` and `c.li`
> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.
The presented code ...
mv rd, x0
beq rs2, x0, skip_next
mv rd, rs1
skip_next:
... vs ... czero.eqz rd, rs1, rs2
... requires that not only rd != rs2 (as stated) but also that rd != rs1. A better implementation is ... mv rd, rs1 // safe even if they are the same register
bne rs2, x0, skip
mv rd, x0
skip:
The RISC-V memory consistency model does not come into it, because there are no loads or stores.Then switching to code involving loads and stores is completely irrelevant:
lw x1, 0(x2)
bne x1, x0, next
next:
sw x3, 0(x4)
First of all, this code is completely crazy because the `bne` is fancy kind of `nop` and a core could convert it to a canonical `nop` (or simply drop it).Even putting the `sw` between the `bne` and the label is ludicrous. There is no branch-free code that does the same thing -- not only in RISC-V but also in arm64 or amd64. SiFive's optimisation will not trigger with a store in that position.
[1] SiFive materials consistently describe it as an optimisation not as fusion e.g. in the description of the chicken bits CSR in the U74 core complex manual.
Having taken a second look, this article does in fact have a point, but it is actually nothing at all to do with conditional moves in the RISC-V instruction set Zicond extension -- or amd64 or arm64 style conditional moves either, if they were added at some point.
It is not even about RISC-V but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86. I'm not as familiar with the Aarch64 memory model, but I think this probably also applies to it.
The point here is that if an aggressive implementation wants to implement instruction fusion that removes conditional branches (or indirect branches) to make a branch-free µop -- for example, to turn a conditional branch over a move into something similar to the `czero` instruction -- then in order to maintain memory ordering AS SEEN BY A DIFFERENT CORE the fused µop has to also have `fence r,w` properties.
That is all.
It is irrelevant to this whether the actual RISC-V instruction set has a conditional move instruction, or the properties it has if it exists.
It is irrelevant to the situation where a human programmer or a compiler might choose to transform branchy code into branch-free code. They have a more global view of the program and can make sure things make sense. A CPU core implementing fusion has only a local view.
Finally, I'll note that instruction fusion is at present hypothetical in RISC-V processors that you can buy today while it has been used in both x86 and Arm chips for a long time.
Intel's "Core" µarch had fusion of e.g. `cmp;bCC` sequences in 2006, while AMD added it with Bulldozer in 2011. Arm introduced a limited capability -- `CMP r0, #0; BEQ label` is given as an example -- in A53 in 2012 and A57, A72 etc expanded the generality.
Upcoming RISC-V cores from companies such as Ventana and Tenstorrent are believed to implement instruction fusion for some cases.
Just for completeness, I'll again repeat that SiFive's U74 optimises execution of a condition branch and a following simple ALU instruction that execute simultaneously in two pipelines, but this is NOT fusion into a single µop.
> but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86
No... It's kind of an artefact of RISC-V's memory model being weak. x86 side-steps the issue because it insists that stores always occur in program order, allowing it to fuse away conditional branches without issue.
(Note: the actual hardware implementation of x86 cpus issues the stores anyway, and then rewinds if it later detects a memory ordering violation)
RISC-V ran into this corner case because it wanted the best of both worlds: A Weak memory model, but still have strong ordering across branches.
Looks like ARM avoided this issue because its memory model is weaker, branches don't force any ordering, which means the arm compiler might need to insert a few extra memory barrier instructions.
---------
TBH, I don't think this fusing instructions edge case is a big deal. For smaller RISC-V cores, you aren't reordering memory operations in the first place.
And for larger RISC-V cores, you already need a complex mechanism for dealing with store order violationss, so you just throw your fused come instruction at it. Your core already needs to deal with sync points that aren't proper branches, because non-taken branches also enforce ordering.
Yeah, RISC-V has the `Zicond` extension, but it's not a "proper" conditional move in the traditional sense. For the usual situations where the compiler would use a single conditional move on any other ISA it now needs multiple instructions to get around the fact that `Zicond` will set the destination register to zero if the conditional move isn't made. This totally sucks for performance if you don't have a "sufficiently advanced magic core" which can macro-fuse `Zicond` instruction sequences.
That said, RISC-V does have a proper conditional move instruction. And the funny part: it has multiple! `xtheadcondmov` and `xmipscmove` both implement "real" conditional moves. The catch is that those are vendor-specific extensions; compared to the official narrative that "it doesn't fit the design of RISC-V" apparently the actual hardware vendors see the value of adding real cmovs to their hardware. I wonder how many more vendor-specific extensions will it take before a common cross-vendor extension is standardized, if ever?
(And yes, I'm perfectly aware of why `Zicond` was designed the way it was. I don't really want to get into a discussion whether that's the right design or not long-term.)