Comment by janalsncm

Comment by janalsncm 2 days ago

4 replies

The null hypothesis is more compute or bigger network = better results. Conv operations make sense on images because the data is naturally 2 dimensional, so applying an operation across a sliding window makes sense.

Skimming the paper, I don’t see them testing against e.g. a normal decoder with an extra layer or something.

I don’t see the same logic applying on an embedding, where the individual indexes matter. Adjacent indexes in an embedding have no relationship, unlike adjacent pixels in an image.

pizza 2 days ago

They do have a weak relationship, in that earlier index tokens were encountered earlier during the formation of the vocabulary, so they are similar in typicality

  • janalsncm 2 days ago

    No, if you check the diagram (page 2) these are literally indexes into the KV vectors, not positional indexes in the text. If it was the text I would agree with you.

    • pizza 18 hours ago

      Oh, I thought you were talking about unorderedness in embedding indices in a general context, to which I was responding to the specific case of vocab embedding indices having a correlation - my apologies

jwilber 2 days ago

Convolutions are used in many non-image applications, including language (eg dilated convolutions have been popular for some time) and 1D cases. The paper I linked references the hyena operator, which is literally a convolution replacement for attention (though it’s often used in hybrid architectures like the one I linked).