Comment by jakestevens2

Comment by jakestevens2 3 days ago

That depends on the model architecture and how it was written since that informs the size of the search space.

The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.

sitkack 3 days ago

Do you learn the capabilities of the underlying hardware relative to the kernel src? You should be able to start predicting perf using learned static profiling.

Reply View 1 reply

jakestevens2 3 days ago

Not today but we will implement memoization of kernels for each hardware backend, yes.

Reply View | 0 replies