Comment by bjackman
> A useful analogy here is the rise of AMD in the datacenter. [...] Large hyperscalers found it worth their time and effort to rewrite extremely low level software to be truly agnostic between AMD and Intel
As someone who works on such low-level software at a hyperscaler I am skeptical of this comparison. The difference between AMD and Intel is really not that great, and in the biggest areas, open source software (e.g. kernel (especially KVM) and compilers) is already fully agnostic, in large part thanks to Intel and AMD themselves. Nobody in this space is gonna buy an x86 CPU without full upstream Linux+KVM+LLVM support.
If breaking down the CUDA wall was the same order of magnitude a challenge as Intel vs AMD CPUs, I would think we would already have broken down that wall by now? Plus, I don't see any sign of Nvidia helping out with that.
I don't know anything about CUDA though so maybe I'm overestimating the barrier here and the real reason is just that people haven't been sufficiently motivated yet.
What they mean is that they are rewriting low level synchronization primitives in order not to penalize AMD CPUs. For example on the AMD Rome CPUs, the cross-CCD latency of atomic instructions could be as high as 200 nanoseconds even when the instructions supposedly access a memory location already in the cache. Common code patterns like multiple cores atomically incrementing a single counter would have borderline acceptable performance on Intel but terrible performance on AMD.
Or consider things like CPU core allocators, which now need to be CCD-aware when allocating cores within a CPU to a container.