Comment by benreesman
Comment by benreesman 2 days ago
It's really difficult to do Python projects in a sound, reproducible, reasonably portable way. uv sync is in general able to build you only a package set that it can promise to build again.
But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip. They are trying to thread a delicate an economically pivotal needle: the Python community prizes privatizing the profits and socializing the costs of "works on my box".
The cost of making the software deployable, secure, repeatable, reliable didn't go away! It just became someone else's problem at a later time in a different place with far fewer degrees of freedom.
Doing this in a way that satisfies serious operations people without alienating the "works on my box...sometimes" crowd is The Lord's Work.
> But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip.
This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case.
What you can do is the following:
1) Compile your CUDA kernels offline. 2) Include those compiled kernels in a package you push to pypi. 3) Call into the kernels with pure Python, without going through a C++ extension.
I do this for the CUDA kernels I maintain and it works great.
Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.
[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...