Comment by deng
> However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
No, because the loop calls cpu_relax(), which is a compiler barrier. It cannot be optimized away.
And yes, reading via the memory bus is much, much slower than a barrier. It's absolutely likely that reading 4 times from main memory on such an old embedded system takes several hundred cycles.
From what I understand the timer registers should be on APB(1) bus which operates at fixed 26MHz clock. That should be much closer to the scale of fast timer clocks compared to cpu_relax() and main CPU clock running somewhere in the range of 0.5-1GHz and potentially doing some dynamic frequency scaling for power saving purpose.
The silliest part of this mess is that 26Mhz clock for APB1 bus is derived from the same source as 13Mhz, 6.5Mhz 3.25Mhz, 1Mhz clocks usable by fast timers.