Comment by jakestevens2

Comment by jakestevens2 3 days ago

Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.

This obviously creates a combinatorial problem that we mitigate with smarter search.

The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.

erichocean 2 days ago

> that we mitigate with smarter search

aka "a heuristic"

Reply View 3 replies

jakestevens2 2 days ago

See my other comments about static profiling of kernels. There are ways of improving the search that keep runtime at the heart of it.

Reply View | 0 replies
jafioti 2 days ago

mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.

Reply View | 0 replies
gregorygoc 2 days ago

Exactly, I was going to ask about this bit…

Reply View | 0 replies

UncleOxidant 3 days ago

How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?

Reply View 4 replies

jakestevens2 3 days ago

That depends on the model architecture and how it was written since that informs the size of the search space.
The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.

Reply View | 2 replies
- sitkack 3 days ago
  
  Do you learn the capabilities of the underlying hardware relative to the kernel src? You should be able to start predicting perf using learned static profiling.
  
  Reply View | 1 reply
  
  jakestevens2 3 days ago
  
  Not today but we will implement memoization of kernels for each hardware backend, yes.
  
  Reply View | 0 replies
jakestevens2 3 days ago

You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.

Reply View | 0 replies

pilooch 3 days ago

Is this a bit similar to what tensorrt does, but in a more opened manner ?

Reply View 0 replies