Comment by montebicyclelo

> The results we obtained in Section 7 imply that, at least on simple datasets like TinyStories and Wikipedia, LLM predictions contain much quantifiable structure insofar that they often can be described in terms of our simple statistical rules

> we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets

Two prediction methods may have completely different mechanisms, but agree sometimes, because they are both predicting the same thing.

Seems a fairly large proportion of language can be predicted by a simpler model.. But it's the remaining percent that's the difficult part; which simple `n-gram` models are bad at, and transformers are really good at.

I've always thought that LLMs are still just statistical machines and that their output is very similar to the superpermutation problem, though not exactly.

I just like to think of it as a high dimensional view of the relationships between various words and that the output is the result of continuing the path taken through that high dimensional space, where each point's probability of selection changes with each token in the sequence.

Unfortunately there's no thought or logic really going on there in the simplest cases as far as I can understand it. Though for more complex models/different architectures anything that fundamentally changes the way that the model explores a path through space like that could be implementing thought/logic I suppose.

It's why they need to outsource mathematics for the most part.