Comment by jabl

Comment by jabl 2 days ago

27 replies

From your past posting history, I presume that you're implying this makes RISC-V better?

Do we have any data showing that having a dedicated zero register is better than a short and canonical instruction for zeroing an arbitrary register?

phire 2 days ago

The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free.

I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions).

AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions).

  • phkahler 2 days ago

    >> The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

    Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway.

    IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-)

    • kruador 2 days ago

      ARM64 also has fixed length 32-bit instructions. Yes, immediates are normally small and it's not particularly orthogonal as to how many bits are available.

      The largest MOV available is 16 bits, but those 16 bits can be shifted by 0, 16, 32 or 48 bits, so the worst case for a 64-bit immediate is 4 instructions. Or the compiler can decide to put the data in a PC-relative pool and use ADR or ADRP to calculate the address.

      ADD immediate is 12 bits but can optionally apply a 12-bit left-shift to that immediate, so for immediates up to 24 bits it can be done in two instructions.

      ARM64 decoding is also pretty complex, far less orthogonal than ARM32. Then again, ARM32 was designed to be decodable on a chip with 25,000 transistors, not where you can spend thousands of transistors to decode a single instruction.

    • zozbot234 2 days ago

      There's not a strong case for redoing the RISC-V encoding with a new RISC-VI unless they run out of 32-bit encoding space outright, due to e.g. extensive new vector-like or AI-like instructions. And then they could free up a huge amount of encoding space trivially by moving to a 2-address format throughout with Rd=Rs1 and using a simple instruction fusion approach MOV Rd ← Rs1; OP Rd ← etc. for the former 3-address case.

      (Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.)

      • phkahler 2 days ago

        >> Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.

        I really like the idea of composition or standard prefixes. My favorite is the idea of replacing cmp/branch with "if". Where the condition is a predicate for the following instruction. For RISC-V it would eat a large part of the 16bit opcodes. Some form of load/store might be a good use for the remaining 16bit ops. Other things that might be a good prefix could be encoding data types (8,16,32,64 bit, sign extended, float, double) or a source/destination register. It might be interesting to see how a full ISA might be decomposed into smaller instruction fragments.

    • adgjlsfhk1 2 days ago

      IMO the riscv decoding is really elegant (arguably excepting the C extension). Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory). Most 64 bit constants in use can be sign extended from much smaller values, and for those that can't, supporting 72 bit (or bigger) instructions just to be able to load a 64 bit immediate will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism), and will only be 2 cycles faster than a L1 cache load (if the instruction is hot). 32 bit immediate would be kind of nice, but the benefit is pretty small. An x86 instruction with 32 bit immediate is 6 bytes, while the 2 RISC-V instructions are 8 bytes. There have been proposals to add 48 bit instructions, which would let Risc-v have 32 bit immediate support with the same 6 bytes as x86 (and 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in the very rare situations where doing so will be faster than a load).

      ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

      • Tuna-Fish 2 days ago

        > Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory)

        Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance.

        > Most 64 bit constants in use can be sign extended from much smaller values

        You should obviously also have smaller load instructions.

        > will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism)

        No, just have more fetch throughput.

        > and will only be 2 cycles faster than a L1 cache load

        Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch.

        And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency.

        > TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

        No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

wongarsu 2 days ago

MIPS for example also has one, along with a similar number of registers (~32). So it's not like RISC-V took a radical new position here, they were able to look back at what worked and what didn't, and decided that for their target a zero register was the right tradeoff. It's certainly the more "elegant" solution. A zero register is useful as input or output register for all kinds of operations, not just for zeroing

kevin_thibedeau 2 days ago

It's a definite liability on a machine with only 8 general purpose registers. Losing 12% of the register space for a constant would be a waste of hardware.

  • menaerus 2 days ago

    8 registers? Ever heard of register renaming?

    • Polizeiposaune 2 days ago

      Ever heard of a loop that needed to keep more than 7 variables live? Register renaming helps with pipelining and out-of-order execution, but instructions in the program can only reference the architectural registers - go beyond that and you end up needing to spill some values to (architectural) memory.

      There's a reason why AMD added r8-r15 to the architecture, and why intel is adding r16-r31..

      • menaerus 2 days ago

        I have but that was not the point? My first point was exactly that there are more ISA registers and not only 8, and therefore the question mark. My second point was about register renaming which, contrary what you say, does mitigate the artifacts of running out of registers by spilling the variables to the stack memory. It does it by eliminating the false dependencies between variables/registers and xor eax, eax is a great candidate for that.

    • account42 2 days ago

      That's irrelevant, the zero register would be taking a slot in the limited register addressing bits in instructions, not replace a physical register on the chip.

gruez 2 days ago

It's basically the eternal debate of RISC vs CISC (x86). RISC proponents claim RISC is better because it's simpler to decode. CISC proponents retort that CISC means code can be more compact, which helps with cache hits.

  • bluGill 2 days ago

    In the real world there is no CISC or RISC anymore. RISC is always extended to some new feature and suddenly becomes more complex. Meanwhile CISC is just a decoder over a RISC processor. Either way you get the best of both worlds: simple hardware (the RISC internals and CSIC instructions that do what you need.

    Don't get too carried away in the above, x86 is still a lot more complex than ARM or RISC-V. However the complexity is only a tiny part of a CPU and so it doesn't matter.

    • snvzz 2 days ago

      You seem to be confusing ISA and microarchitecture.

      Modern ISAs try really hard to be independent from microarchitecture.

  • snvzz 2 days ago

    >CISC proponents retort that CISC means code can be more compact,

    RISC-V has the most compact code on 64bit, with margin to boot.

    On 32bit, it used to be behind Thumb2, but it's the best as of the bit manipulation and extra compressed extensions circa 2021.

dooglius 2 days ago

I think one could just pick a convention where a particular GP register is zeroed at program startup and just make your own zero register that way, getting all the benefits at very small cost. The microarchitecture AIUI has a dedicated zero register so any processor-level optimizations would still apply.

  • pklausler 2 days ago

    That’s what was done on the CDC 6600 with two handy values, B0 (0) and B1 (1).