Comment by ryao

I don't think that library refutes anything of what I said.

First of, that library requires you to fundamentally change the code, by moving some precomputation outside loops.

Of course I can do a similar trick to move the division outside the loop without that library using simple fixed-point math, something which is a very basic optimization technique. So any performance comparison would have to be against that, not the original code.

It is also much, much slower if your denominator changes for each invocation:

In terms of processor time, pre-computing the proper magic number and shift is on the order of one to three hardware divides, for non-powers of 2.

If you care about a fast hardware divider, then you're much more likely to have such code rather than the trivially-optimized code like the library example.