Comment by janalsncm
The null hypothesis is more compute or bigger network = better results. Conv operations make sense on images because the data is naturally 2 dimensional, so applying an operation across a sliding window makes sense.
Skimming the paper, I don’t see them testing against e.g. a normal decoder with an extra layer or something.
I don’t see the same logic applying on an embedding, where the individual indexes matter. Adjacent indexes in an embedding have no relationship, unlike adjacent pixels in an image.
Convolutions are used in many non-image applications, including language (eg dilated convolutions have been popular for some time) and 1D cases. The paper I linked references the hyena operator, which is literally a convolution replacement for attention (though it’s often used in hybrid architectures like the one I linked).