Comment by kouteiheika
Comment by kouteiheika 2 days ago
> But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip.
This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case.
What you can do is the following:
1) Compile your CUDA kernels offline. 2) Include those compiled kernels in a package you push to pypi. 3) Call into the kernels with pure Python, without going through a C++ extension.
I do this for the CUDA kernels I maintain and it works great.
Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.
[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...
While shipping binary kernels may be a workaround for some users, it goes against what many people would consider "good etiquette" for various valid reasons, such as hackability, security, or providing free (as in liberty) software.