Show HN: Luminal – Open-source, search-based GPU compiler

145 points by jafioti 3 days ago

Hi HN, I’m Joe. My friends Matthew, Jake and I are building Luminal (https://luminalai.com/), a GPU compiler for automatically generating fast GPU kernels for AI models. It uses search-based compilation to achieve high performance.

We take high level model code, like you'd have in PyTorch, and generate very fast GPU code. We do that without using LLMs or AI - rather, we pose it as a search problem. Our compiler builds a search space, generates millions of possible kernels, and then searches through it to minimize runtime.

You can try out a demo in `demos/matmul` on mac to see how Luminal takes a naive operation, represented in our IR of 12 simple operations, and compiles it to an optimized, tensor-core enabled Metal kernel. Here’s a video showing how: https://youtu.be/P2oNR8zxSAA

Our approach differs significantly from traditional ML libraries in that we ahead-of-time compile everything, generate a large search space of logically-equivalent kernels, and search through it to find the fastest kernels. This allows us to leverage the Bitter Lesson to discover complex optimizations like Flash Attention entirely automatically without needing manual heuristics. The best rule is no rule, the best heuristic is no heuristic, just search everything.

We’re working on bringing CUDA support up to parity with Metal, adding more flexibility to the search space, adding full-model examples (like Llama), and adding very exotic hardware backends.

We aim to radically simplify the ML ecosystem while improving performance and hardware utilization. Please check out our repo: https://github.com/luminal-ai/luminal and I’d love to hear your thoughts!

Alifatisk 3 days ago

So wait, am I understanding this correctly?

Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?

And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include

- Different ways to partition workloads among GPU threads and blocks

- Varying memory access strategies (using shared memory, global memory)

- Various instruction-level optimizations or reordering

- Alternative loop unroll factors or vectorization strategies

The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.

In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?

This is such an interesting technique.

Reply View 17 replies

jakestevens2 3 days ago

Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.
This obviously creates a combinatorial problem that we mitigate with smarter search.
The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.

Reply View | 10 replies
- UncleOxidant 3 days ago
  
  How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?
  
  Reply View | 4 replies
  
  jakestevens2 3 days ago
  
  That depends on the model architecture and how it was written since that informs the size of the search space.
  The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.
  
  Reply View | 2 replies
  
  jakestevens2 3 days ago
  
  You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.
  
  Reply View | 0 replies
- erichocean 2 days ago
  
  > that we mitigate with smarter search
  aka "a heuristic"
  
  Reply View | 3 replies
  
  jakestevens2 2 days ago
  
  See my other comments about static profiling of kernels. There are ways of improving the search that keep runtime at the heart of it.
  
  Reply View | 0 replies
  
  jafioti 2 days ago
  
  mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.
  
  Reply View | 0 replies
  
  gregorygoc 2 days ago
  
  Exactly, I was going to ask about this bit…
  
  Reply View | 0 replies
- pilooch 2 days ago
  
  Is this a bit similar to what tensorrt does, but in a more opened manner ?
  
  Reply View | 0 replies
jafioti 2 days ago

yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).
yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.

Reply View | 4 replies
- _0ffh 2 days ago
  
  Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?
  
  Reply View | 0 replies
- almostgotcaught 2 days ago
  
  > take a long time compiling
  Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).
  
  Reply View | 2 replies
  
  jafioti 2 days ago
  
  np-hard is still solveable with constraints. look at go.
  
  Reply View | 1 reply
  
  gregorygoc 2 days ago
  
  What about it?
  
  Reply View | 0 replies
[removed] 3 days ago

[deleted]

Reply View | 0 replies

diggan 3 days ago

> Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.

Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!

Reply View 2 replies

Reubend 2 days ago

Yeah those numbers look very low to me for something that's supposed to represent a state of the art optimization technique. I think that's lower than other implementations, although it depends on the MacBook.
Nonetheless this project looks very cool, and I hope they can continue improving it to the point where it indeed beats human-led optimizations.

Reply View | 0 replies
jafioti 3 days ago

a lot of the search is still being optimized so we don't match super hand-optimized kernels like llama.cpp has, so we def don't match their tps yet, but i want to make a perf tracking page to see improvements over time and prevent regressions

Reply View | 0 replies

sakras 3 days ago

I see you guys are using Egg/Egglog! I've been mildly interested in egraphs for quite a while, glad to see they're gaining traction!

Reply View 2 replies

PoignardAzur 2 days ago

Right, my first thought when reading the blurb was "kinda sounds like e-graphs?"

Reply View | 1 reply
- jafioti 2 days ago
  
  e-graphs are awesome! none of this would be possible without them.
  
  Reply View | 0 replies

aleinin 3 days ago

Cool project! How do you think about targeting hardware-specific ISAs directly? There’s an interesting paper from Citadel (https://arxiv.org/pdf/1804.06826) that highlights inefficiencies in nvcc for the Volta architecture. Do you see Luminal’s search-based paradigm eventually extending beyond outperforming handwritten kernels, towards actually competing with NVIDIA’s compiler optimizations at the PTX level?

Reply View 4 replies

jafioti 3 days ago

yep! currently we're emitting cuda / metal but once the search is better, i want to directly emit ptx / low-level asm on other hardwares.

Reply View | 3 replies
- Lerc 2 days ago
  
  I don't suppose you have an eye towards verilog in the long term?
  I'm curious as to the breadth of possibilities that could be searched. I would imagine something like this could invent flash attention if it cast its net wide enough, but that is a pretty broad net. [Edit: I scrolled back and saw flash attention was explicitly mentioned, cool stuff]
  
  Reply View | 2 replies
  
  bojle 2 days ago
  
  Equality saturation (something that luminal uses at its core) is a topic for hardware synthesis and verification too. Something like dynamic hardware generation (instead of kernel generation). For example, see this thesis [1] by Samuel Coward of Imperial.
  [1] https://samuelcoward.co.uk/assets/pdf/Thesis_Imperial.pdf
  
  Reply View | 0 replies
  
  jafioti 2 days ago
  
  you suppose correctly ;)
  
  Reply View | 0 replies

AkashKarnatak 3 days ago

Very cool project. Earlier tinygrad used to have ~25 ops but now it has grown to 86 and I believe it is primarily to support hardware feature like tensor core and tma. I don't think luminal supports tensor cores as of now, how do you think the ops will evolve as the library matures.

Reply View 1 reply

jafioti 3 days ago

we do support tensor cores, but the ops are only part of the search space, so there's virtually no overhead for them. the frontend and main ir is only 12 ops, and we can add hardware-specific ops in to the search space and only add in a bit of code in the codegen pass to support them.

Reply View | 0 replies

helltone 2 days ago

I have a background in program analysis, but I'm less familiar with the kind of kernels you are optimising.

- Can you give some more insight on why 12 ops suffice for representing your input program?

- With such a small number of ops, isn't your search space full of repeat patterns? I understand the will to have no predefined heuristics, but it seems that learning some heuristics/patterns would massively help reduce the space.

Reply View 3 replies

jafioti 2 days ago

we're just optimizing linear algebra, which is mostly made up of patterns of simple ops. for instance, matmul is just broadcasted multiply -> sum reduce.
the search does common subexpression elimination by default. if two patterns are unioned in the search space, it applies that union to every occurrence of that pattern at the same time, so using e-graphs it helps keep the search space smaller.

Reply View | 2 replies
- helltone 2 days ago
  
  Right I think I see it.
  This is insanely cool.
  But then there are performance tradeoffs in reusing intermediates vs recomputing that I think you can't represent.
  Some of these may affect numerical stability btw. See eg https://herbie.uwplse.org/
  There is so much potential in this project.
  
  Reply View | 1 reply
  
  jafioti 2 days ago
  
  ah i see the confusion. we do common subexpression elimination of the terms in the search space (which allows single application of rewrites to apply to many repeat patterns) but the search can choose to re-use patterns of terms when we extract dags after the search space is built. so various levels of recomputation are searched.
  right now since we're profiling kernels, and we have a reference output of the unoptimised version, we can directly measure deviation of profiled outputs "for free" since we're already computing them for runtime. tbh this isn't what i want long term, i want to bake numerical stability natively into the search space to only extract dags that would produce stable outputs. hopefully that'll be solved soon.
  
  Reply View | 0 replies

warangal 2 days ago

Pretty cool project!, I have been also trying to do something similar with very limited (abstract) OPs akin to fundamental computer instructions. Just using the numpy backend for now to test theory, but neat thing is that most of complexity lies in the abstract space like deciding which memory accesses could be coalesced even before generating the final code for a specific backend! As far as i know most of DL compilers struggle to generate optimum code, as model starts getting bigger and bigger . Halide project was/is a very cool project that speed up many kernels just by finding better cache/memory access pattern. If you happen to share more insights about your projects through blog-posts or whitepaper that would be really helpful.

Reply View 0 replies

cedws 2 days ago

Around the time DeepSeek R2 released there was chatter about how DeepSeek had had an “undocumented” PTX instruction to squeeze as much performance as possible from their hardware. My understanding is that it wasn’t any kind of secret instruction but just a novel way that they put the instruction together.

Would Luminal be capable of rediscovering this trick?

Reply View 1 reply

jafioti 2 days ago

hopefully! i dont know the exact trick they used, but the idea is to design the search space such that that trick is discoverable.

Reply View | 0 replies

ttoinou 2 days ago

Is it possible that with all the models you’re testing you’re going to find simple rules to optimize kernels so that we won’t need a meta optimizer in the future ? And just code something straight that applies the most important optimizations. Maybe the current search is always ending up on the same kind of codes in the end

Reply View 1 reply

jakestevens2 2 days ago

See my comment on a deeper thread about this. Eventually we will implement static profiling for common kernels so the search doesn't actually have to manually run all of them; many will have a known runtime that we can tie to them.

Reply View | 0 replies

JonChesterfield a day ago

This might make a reasonably good correctness fuzzer for the underlying compiler. Lots of input code meant to calculate the same thing, report when a pair are found that calculate a different result.

Reply View 0 replies

fancyfredbot 2 days ago

This is a good idea. Do you use a cost model for the search or are you actually executing kernels? What kind of heuristics do you use to avoid search space becoming intractabl

Reply View 2 replies

jafioti 2 days ago

we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.

Reply View | 0 replies
matthewjgunton 2 days ago

our cost function right now is just the latency of the kernel. we execute on the hardware as is it really the only accurate way to see how fast the kernel will run

Reply View | 0 replies

efnx 3 days ago

Cool! How is this project different from the tuning process in TVM?

Reply View 1 reply

jafioti 3 days ago

basically autotuning on steroids. instead of searching single dimensions of optimization (tile sizing, etc.) we search through full algebraic rewrites (like rewriting softmax to online softmax) and various loop / tiling structures in the same unified search space.

Reply View | 0 replies

GregarianChild 2 days ago

How is this different from superoptimisation?

Also, how do you ensure that newly generated kernels are correct w.r.t. the original naive kernel that you use as specification?

Reply View 4 replies

jafioti 2 days ago

very similar to superoptimisation, but most superoptimisers try to tackle turing-complete code. by just doing a very limited space of computation (linear algebra with 12 primitive ops) the search remains tractable.
the search space is designed to remain logically equivalent at all times, by virtue of how its built (applying rewrite rules we know dont change the logical equivalence).

Reply View | 3 replies
- GregarianChild 2 days ago
  
  If the search space never leaves the programs that are equivalent to the original specification, that will probably limit the optimisations you can discover. (E.g. if you start out with standard matmul, you will not discover Strassen's algorithm.) This is not a criticism, I'm just trying to understand your algorithm.
  
  Reply View | 2 replies
  
  jafioti 2 days ago
  
  could be...im not opposed to looking into this to see if there's no possible trajectory from naive to strassen's without leaving logical equivalency.
  all the optimizations for matmul so far have been straightforward trajectories from naive (tiling, smem caching, tensor core offload, etc.)
  
  Reply View | 1 reply
  
  GregarianChild 2 days ago
  
  There is an old CACM post that explains how to use a bit of randomness to avoid only doing semantics preserving program changes.
  https://cacm.acm.org/research/stochastic-program-optimizatio...
  
  Reply View | 0 replies

latchkey 2 days ago

Neat. I love supporting things like this. If you'd like some free compute time on MI300x, reach out.

Reply View 0 replies

giacomoforte 2 days ago

Hey, I have been following your project for a while, because I'm kinda interested in progam synthesis. Anyway my question is, how scaleable is the search process itself? Is it a good fit for GPU clusters? I guess benchmarking of candidate kernels takes much longer than generating candidate kernels, or not?

Reply View 1 reply

jafioti 2 days ago

yep, parallelized profiling across many devices is definitely something i want to add.

Reply View | 0 replies

dvdplm 3 days ago

This is very cool. Do you have any advice on papers to read to understand the details of search based compilation a bit more?

Reply View 1 reply

jafioti 3 days ago

a lot of the ideas luminal is built on are here: https://arxiv.org/abs/2304.04332

Reply View | 0 replies

UncleOxidant 3 days ago

When you say (in the video) that you can target more exotic hardware, what about things FGPA accelerators (maybe taking advantage of TVM's FPGA backend)?

Also, what about CUDA alternatives like ROCm?

Reply View 2 replies

matthewjgunton 3 days ago

Yup. We are totally hardware agnostic

Reply View | 1 reply
- matthewjgunton 3 days ago
  
  i should add this also applies to the language too. we currently support Metal (Apple's language) and CUDA, with extensions planned for others
  
  Reply View | 0 replies