Comment by measurablefunc

Comment by measurablefunc 3 hours ago

There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.

vlovich123 3 hours ago

While you are correct at a higher level, comparing bigrams/trigrams would be less compute not more because there’s fewer of them in a given text

Reply View 7 replies

measurablefunc 3 hours ago

I'm correct on the technical level as well: https://chatgpt.com/s/t_698293481e308191838b4131c1b605f1

Reply View | 6 replies
- refulgentis 2 hours ago
  
  That math is for comparing all n-grams for all n <= N simultaneously, which isn't what was being discussed.
  For any fixed n-gram size, the complexity is still O(N^2), same as standard attention.
  
  Reply View | 5 replies
  
  measurablefunc an hour ago
  
  I was talking about all n-gram comparisons.
  
  Reply View | 4 replies

refulgentis 3 hours ago

That skips an important part: the "deep" in "deep learning".

Attention already composes across layers.

After layer 1, you're not comparing raw tokens anymore. You're comparing tokens-informed-by-their-context. By layer 20, you're effectively comparing rich representations that encode phrases, relationships, and abstract patterns. The "higher-order" stuff emerges from depth. This is the whole point of deep networks, and attention.

TL;DR for rest of comment: people have tried shallow-and-wide instead of deep, it doesn't work in practice. (rest of comment fleshes out search/ChatGPT prompt terms to look into to understand more of the technical stuff here)

A shallow network can approximate any function (universal approximation theorem), but it may need exponentially more neurons. Deep networks represent the same functions with way fewer parameters. There's formal work on "depth separation",functions that deep nets compute efficiently, but shallow nets need exponential width to match.

Empirically, People have tried shallow-and-wide vs. deep-and-narrow many times, across many domains. Deep wins consistently for the same parameter budget. This is part of why "deep learning" took off, the depth is load-bearing.

For transformers specifically, stacking attention layers is crucial. A single attention layer, even with more heads or bigger dimensions, doesn't match what you get from depth. The representations genuinely get richer in ways that width alone can't replicate.

Reply View 0 replies