Comment by bob1029
> Secondly, we have got machines equipped with multi-level stores, presenting us problems of management strategy that, in spite of the extensive literature on the subject, still remain rather elusive.
NUMA only got more complicated over time. The range of latency differences is more extreme than ever. We've got L1 running at nanosecond delay, and on the other end we've got cold tapes that can take a whole day to load. Which kind of memory/compute to use in a heterogeneous system (cpu/gpu) is also something that can be difficult to figure out. Multi core is likely the most devastating dragon to arrive since this article was written.
Premature optimization might be evil, but it's the only way to efficiently align the software with the memory architecture. E.g., in a Unity application, rewriting from game objects to ECS is basically like starting over.
If you could only focus on one aspect, I would keep the average temperature of L1 in mind constantly. If you can keep it semi-warm, nothing else really matters. There are very few problems that a modern CPU can't chew through ~instantly assuming the working set is in L1 and there is no contention with other threads.
This is the same thinking that drives some of us to use SQLite over hosted SQL providers. Thinking in terms of not just information, but the latency domain of the information, is what can unlock those bananas 1000x+ speed ups.