Comment by danhor

Comment by danhor 14 hours ago

6 replies

RISC-V is even worse: The Cortex-M series have standardized interrupt handling and are built so you can avoid writing any assembly for the startup code.

Meanwhile the RISC-V spec only defines very basic interrupt functionality, with most MCU vendors adding different external interrupt controllers or changing their cores to more closely follow the faster Cortex-M style, where the core itself handles stashing/unstashing registers, exit of interrupt handler on ret, vectoring for external interrupts, ... .

The low knowledge/priority of embedded of RISC-V can be seen in how long it took to specify an extension tha only includes multiplication, not division.

Especially for smaller MCUs the debug situation is unfortunate: In ARM-World you can use any CMSIS-DAP debug probe to debug different MCUs over SWD. RISC-V MCUs either have JTAG or a custom pin-reduced variant (as 4 pins for debugging is quite a lot) which is usually only supported by very few debug probes.

RISC-V just standardizes a whole lot less (and not sensibly for small embedded) than ARM.

ryao 12 hours ago

Being customizable is one of RISC-V’s strengths. Multiplication can be easily done in software by doing bit shifts and addition in a loop. If an embedded application does not make heavy use of multiplication, you can omit multiplication from the silicon for cost savings.

That said, ARM’s SWD is certainly nice. It appears to be possible to debug the Hazard3 cores in the RP2350 in the same way as the ARM cores:

https://gigazine.net/gsc_news/en/20241004-raspberry-pi-pico-...

  • magicalhippo 11 hours ago

    > If an embedded application does not make heavy use of multiplication, you can omit multiplication from the silicon for cost savings.

    The problem was that the initial extension that included multiplication also included division[1]. A lot of small microcontrollers have multiplication hardware but not division hardware.

    Thus it would make sense to have a multiplication-only extension.

    IIRC the argument was that the CPU should just trap the division instructions and emulate them, but in the embedded world you'll want to know your performance envelopes so better to explicitly know if hardware division is available or not.

    [1]: https://docs.openhwgroup.org/projects/cva6-user-manual/01_cv...

    • ryao 11 hours ago

      Software division is often faster than hardware division, so your performance remark seems to be a moot point:

      https://libdivide.com/

      • magicalhippo 10 hours ago

        I don't think that library refutes anything of what I said.

        First of, that library requires you to fundamentally change the code, by moving some precomputation outside loops.

        Of course I can do a similar trick to move the division outside the loop without that library using simple fixed-point math, something which is a very basic optimization technique. So any performance comparison would have to be against that, not the original code.

        It is also much, much slower if your denominator changes for each invocation:

        In terms of processor time, pre-computing the proper magic number and shift is on the order of one to three hardware divides, for non-powers of 2.

        If you care about a fast hardware divider, then you're much more likely to have such code rather than the trivially-optimized code like the library example.

        • ryao 8 hours ago

          Good point. I withdraw my remark.

      • [removed] 10 hours ago
        [deleted]