Comment by Alifatisk
Comment by Alifatisk 3 days ago
So wait, am I understanding this correctly?
Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?
And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include
- Different ways to partition workloads among GPU threads and blocks
- Varying memory access strategies (using shared memory, global memory)
- Various instruction-level optimizations or reordering
- Alternative loop unroll factors or vectorization strategies
The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.
In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?
This is such an interesting technique.
Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.
This obviously creates a combinatorial problem that we mitigate with smarter search.
The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.