Comment by majke
I had a bit, limited, exposure to cuda. It was before the AI boom, during Covid.
I found it easy to start. Then there was a pretty nice learning curve to get to warps, SM's and basic concepts. Then I was able to dig deeper into the integer opcodes, which was super cool. I was able to optimize the compute part pretty well, without much roadblocks.
However, getting memory loads perfect and then getting closer to hw (warp groups, divergence, the L2 cache split thing, scheduling), was pretty hard.
I'd say CUDA is pretty nice/fun to start with, and it's possible to get quite far for a novice programmer. However getting deeper and achieving real advantage over CPU is hard.
Additionally there is a problem with Nvidia segmenting the market - some opcodes are present in _old_ gpu's (CUDA arch is _not_ forwards compatible). Some opcodes are reserved to "AI" chips (like H100). So, to get code that is fast on both H100 and RTX5090 is super hard. Add to that a fact that each card has different SM count and memory capacity and bandwidth... and you end up with an impossible compatibility matrix.
TLDR: Beginnings are nice and fun. You can get quite far on the optimizing compute part. But getting compatibility for differnt chips and memory access is hard. When you start, chose specific problem, specific chip, specific instruction set.