Comment by rep_lodsb
You're right, didn't account for that. Though even when declared volatile, the counter variable would be on the stack, and thus already in the CPU cache (at least 32K according to the datasheet)?
Looking at the assembly code for both versions of this delay loop might clear it up.
The only thing volatile does is to assure that the value is read from memory each time (which implicitly also forbids optimizations). Whether that memory is in a CPU cache is purely a hardware issue and outside the C specification. If you read something like a hardware register, you yourself need to take care in some way that a hardware cache will not give you old values (by mapping it into a non-cached memory area, or by forcing a cache update). If you for-loop over something that acts as a compiler barrier, all that 'volatile' on the counter variable will do is potentially make the for-loop slower.
There's really just very few reasons to ever use 'volatile'. In fact, the Linux kernel even has its own documentation why you should usually not use it:
https://www.kernel.org/doc/html/latest/process/volatile-cons...