Comment by einpoklum
> It's a huge pain, and very sensitive to code changes.
Optimization is very often like that. Making things generic, uniform and simple typically has a performance penalty - and you use your GPU because you care about that stuff.