Comment by jafioti
yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).
yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.
Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?