Comment by janalsncm
The null hypothesis is more compute or bigger network = better results. Conv operations make sense on images because the data is naturally 2 dimensional, so applying an operation across a sliding window makes sense.
Skimming the paper, I don’t see them testing against e.g. a normal decoder with an extra layer or something.
I don’t see the same logic applying on an embedding, where the individual indexes matter. Adjacent indexes in an embedding have no relationship, unlike adjacent pixels in an image.
They do have a weak relationship, in that earlier index tokens were encountered earlier during the formation of the vocabulary, so they are similar in typicality