Comment by rep_lodsb
It's possible that actually reading the register takes (significantly) more time than an empty countdown loop. A somewhat extreme example of that would be on x86, where accessing legacy I/O ports for e.g. the timer goes through a much lower-clocked emulated ISA bus.
However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
> However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
No, because the loop calls cpu_relax(), which is a compiler barrier. It cannot be optimized away.
And yes, reading via the memory bus is much, much slower than a barrier. It's absolutely likely that reading 4 times from main memory on such an old embedded system takes several hundred cycles.