Comment by noosphr
Yes, and it works in theory.
Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse.
To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.