Comment by Alifatisk

Comment by Alifatisk 3 days ago

So wait, am I understanding this correctly?

Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?

And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include

- Different ways to partition workloads among GPU threads and blocks

- Varying memory access strategies (using shared memory, global memory)

- Various instruction-level optimizations or reordering

- Alternative loop unroll factors or vectorization strategies

The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.

In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?

This is such an interesting technique.

jakestevens2 3 days ago

Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.

This obviously creates a combinatorial problem that we mitigate with smarter search.

The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.

Reply View 10 replies

erichocean 2 days ago

> that we mitigate with smarter search
aka "a heuristic"

Reply View | 3 replies
- jakestevens2 2 days ago
  
  See my other comments about static profiling of kernels. There are ways of improving the search that keep runtime at the heart of it.
  
  Reply View | 0 replies
- jafioti 2 days ago
  
  mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.
  
  Reply View | 0 replies
- gregorygoc 2 days ago
  
  Exactly, I was going to ask about this bit…
  
  Reply View | 0 replies
UncleOxidant 3 days ago

How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?

Reply View | 4 replies
- jakestevens2 3 days ago
  
  That depends on the model architecture and how it was written since that informs the size of the search space.
  The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.
  
  Reply View | 2 replies
  
  sitkack 3 days ago
  
  Do you learn the capabilities of the underlying hardware relative to the kernel src? You should be able to start predicting perf using learned static profiling.
  
  Reply View | 1 reply
  
  jakestevens2 3 days ago
  
  Not today but we will implement memoization of kernels for each hardware backend, yes.
  
  Reply View | 0 replies
- jakestevens2 3 days ago
  
  You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.
  
  Reply View | 0 replies
pilooch 3 days ago

Is this a bit similar to what tensorrt does, but in a more opened manner ?

Reply View | 0 replies

jafioti 3 days ago

yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).

yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.

Reply View 4 replies

_0ffh 3 days ago

Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?

Reply View | 0 replies
almostgotcaught 3 days ago

> take a long time compiling
Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).

Reply View | 2 replies
- jafioti 2 days ago
  
  np-hard is still solveable with constraints. look at go.
  
  Reply View | 1 reply
  
  gregorygoc 2 days ago
  
  What about it?
  
  Reply View | 0 replies

[removed] 3 days ago

[deleted]

Reply View 0 replies