Comment by ncruces
Did you look at the algorithm 1? The setup code is a pair of splats. There's no consideration for alignment, it works with unaligned data just fine. So what do you mean? What threshold would you expect for it to be worth it (assuming it's useful at all)?
Dude, I cannot be more explicit than that.
That said, when you work on such code, use assembly, C code is for reference (have a look at dav1d av1 decoder).