Comment by magicalhippo

Comment by magicalhippo 2 months ago

> If an embedded application does not make heavy use of multiplication, you can omit multiplication from the silicon for cost savings.

The problem was that the initial extension that included multiplication also included division[1]. A lot of small microcontrollers have multiplication hardware but not division hardware.

Thus it would make sense to have a multiplication-only extension.

IIRC the argument was that the CPU should just trap the division instructions and emulate them, but in the embedded world you'll want to know your performance envelopes so better to explicitly know if hardware division is available or not.

[1]: https://docs.openhwgroup.org/projects/cva6-user-manual/01_cv...

ryao 2 months ago

Software division is often faster than hardware division, so your performance remark seems to be a moot point:

https://libdivide.com/

Reply View 3 replies

magicalhippo 2 months ago

I don't think that library refutes anything of what I said.
First of, that library requires you to fundamentally change the code, by moving some precomputation outside loops.
Of course I can do a similar trick to move the division outside the loop without that library using simple fixed-point math, something which is a very basic optimization technique. So any performance comparison would have to be against that, not the original code.
It is also much, much slower if your denominator changes for each invocation:
In terms of processor time, pre-computing the proper magic number and shift is on the order of one to three hardware divides, for non-powers of 2.
If you care about a fast hardware divider, then you're much more likely to have such code rather than the trivially-optimized code like the library example.

Reply View | 1 reply
- ryao 2 months ago
  
  Good point. I withdraw my remark.
  
  Reply View | 0 replies
[removed] 2 months ago

[deleted]

Reply View | 0 replies