Comment by measurablefunc
Comment by measurablefunc 3 hours ago
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.
While you are correct at a higher level, comparing bigrams/trigrams would be less compute not more because there’s fewer of them in a given text