Comment by jsenn

This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.

A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”