Comment by korbip

I can share a similar PhD story (the result being visible here: https://github.com/NX-AI/flashrnn). Back then I didn't find any tutorials that cover anything beyond the basics (which are still important). Once you have understood the principle working mode and architecture of a GPU, I would recommend the following workflow: 1. First create an environment so that you can actually test your kernels against baselines written in a higher-level language. 2. If you don't have an urgent project already, try to improve/re-implement existing problems (MatMul being the first example). Don't get caught by wanting to implement all size cases. Take an example just to learn a certain functionality, rather than solving the whole problem if it's just about learning. 3. Write the functionality you want to have in increasing complexity. Write loops first, then parallelize these loops over the grid. Use global memory first, then put things into shared memory and registers. Use plain matrix multiplication first, then use mma (TensorCore) primitives to speed things up. 4. Iterate over the CUDA C Programming Guide. It covers all (most) of the functionality that you want to learn - but can't be just read an memorized. When you apply it you learn it. 5. Might depend on you use-case but also consider using higher-level abstractions like CUTLASS or ThunderKitten. Also, if your environment is jax/torch, use triton first before going to CUDA level.

Overall, it will be some pain for sure. And to master it including PTX etc. will take a lot of time.