Comment by nopinsight

Comment by nopinsight 10 months ago

In the words of an author:

"What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

http://arxiv.org/abs/2402.12875 (ICLR 2024)"

https://x.com/denny_zhou/status/1835761801453306089

ec109685 10 months ago

Is this the infinite monkey Shakespeare trope?

Reply View 39 replies

throwup238 10 months ago

More like the universal approximation theorem extended to computation rather than network complexity: https://en.wikipedia.org/wiki/Universal_approximation_theore...

Reply View | 1 reply
- immibis 10 months ago
  
  The universal approximation theorem is good to know because says there's no theoretical upper bound to a function-approximating NN's accuracy. In practice it says nothing about what can be realistically achieved, though.
  
  Reply View | 0 replies
nopinsight 10 months ago

A key difference is that the way LMMs (Large Multimodal Models) generate output is far from random. These models can imitate/blend existing information or imitate/probably blend known reasoning methods in the training data. The latter is a key distinguishing feature of the new OpenAI o1 models.
Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.
Arguably, humans rely on similar modes of "thinking" most of the time as well.

Reply View | 0 replies
CamperBob2 10 months ago

Yeah. Monkeys. Monkeys that write useful C and Python code that needs a bit less revision every time there's a model update.
Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.

Reply View | 35 replies
- visarga 10 months ago
  
  If you think about it, humans necessarily use abstractions, from the edge detectors in retina to concepts like democracy. But do we really understand? All abstractions leak, and nobody knows the whole stack. For all the poorly grasped abstractions we are using, we are also just parroting. How many times are we doing things because "that is how they are done" never wondering why?
  Take ML itself, people are saying it's little more than alchemy (stir the pile). Are we just parroting approaches that have worked in practice without real understanding? Is it possible to have centralized understanding, even in principle, or is all understanding distributed among us? My conclusion is that we have a patchwork of partial understanding, stitched together functionally by abstractions. When I go to the doctor, I don't study medicine first, I trust the doctor. Trust takes the place of genuine understanding.
  So humans, like AI, use distributed and functional understanding, we don't have genuine understanding as meant by philosophers like Searle in the Chinese Room. No single neuron in the brain understands anything, but together they do. Similarly, no single human understands genuinely, but society together manages to function. There is no homunculus, no centralized understander anywhere. We humans are also stochastic parrots of abstractions we don't really grok to the full extent.
  
  Reply View | 5 replies
  
  throwaway290 10 months ago
  
  > My conclusion
  Are you saying you understood something? Was it genuine? Do you think LLM feels the same thing?
  
  Reply View | 3 replies
  
  kaechle 10 months ago
  
  Great points. We're pattern-matching shortcut machines, without a doubt. In most contexts, not even good ones.
  > When I go to the doctor, I don't study medicine first, I trust the doctor. Trust takes the place of genuine understanding.
  The ultimate abstraction! Trust is highly irrational by definition. But we do it all day every day, lest we be classified as psychologically unfit for society. Which is to say, mental health is predicated on a not-insignificant amount of rationalizations and self-deceptions. Hallucinations, even.
  
  Reply View | 0 replies
- kaechle 10 months ago
  
  Every time I read "stochastic parrot," my always-deterministic human brain surfaces this quote:
  > “Most people are other people. Their thoughts are someone else's opinions, their lives a mimicry, their passions a quotation.”
  - Oscar Wilde, a great ape with a pen
  
  Reply View | 2 replies
  
  OKRainbowKid 10 months ago
  
  Reading this quote makes me wonder why I should believe that I am somehow special or different, and not just another "other".
  
  Reply View | 1 reply
  
  HeatrayEnjoyer 10 months ago
  
  That's just it. We're not unique. We've always been animals running on instinct in reaction to our environment. Our instincts are more complex than other animals but they are not special and they are replicable.
  
  Reply View | 0 replies
- hegFdH 10 months ago
  
  The infinite monkey post was in response to this claim, which, like the universal approximation theorem, is useless in practice:
  "We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient."
  Like an LLM, you omit the context and browbeat people with the "truth" you want to propagate. Together with many political forbidden terms since 2020, let us now also ban "stochastic parrot" in order to have a goodbellyfeel newspeak.
  
  Reply View | 1 reply
  
  chaosist 10 months ago
  
  There is also a problem of "stochastic parrot" being constantly used in a pejorative sense as opposed to a neutral term to keep grounded and skeptical.
  Of course, it is an overly broad stroke that doesn't quite capture all the nuance of the model but the alternative of "come on guys, just admit the model is thinking" is much worse and has much less to do with reality.
  
  Reply View | 0 replies
- ffsm8 10 months ago
  
  > novel commentary three or four years ago,
  Chatgpt was released November 2022. That's one year and 10 months ago. Their marketing started in the summer of the same year, still far of from 3-4 years.
  
  Reply View | 6 replies
  
  Banou 10 months ago
  
  But chatgpt wasnt the first, openai had coding playground with gpt2, and you could already code even before that, around 2020 already, so I'd say it has been 3-4years
  
  Reply View | 0 replies
  
  killerstorm 10 months ago
  
  GPT-3 paper announcement got 200 comments on HN back in 2020.
  It doesn't matter when marketing started, people were already discussing it in 2019-2020.
  Stochastic parrot: The term was coined by Emily M. Bender[2][3] in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.[4]
  
  Reply View | 3 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
- 93po 10 months ago
  
  AI news article comments bingo card:
  * Tired ClosedAI joke
  * Claiming it's predictive text engine that isn't useful for anything
  * Safety regulations are either good or bad, depending on who's proposing them
  * Fear mongering about climate impact
  * Bringing up Elon for no reason
  * AI will never be able to [some pretty achievable task]
  * Tired arguments from pro-IP / copyright sympathizers
  
  Reply View | 16 replies
  
  kmeisthax 10 months ago
  
  > Tired ClosedAI joke
  > Tired arguments from pro-IP / copyright sympathizers
  You forgot "Tired ClosedAI joke from anti-IP / copyleft sympathizers".
  Remember that the training data debate is orthogonal to the broader debate over copyright ownership and scope. The first people to start complaining about stolen training data were the Free Software people, who wanted a legal hook to compel OpenAI and GitHub to publish model weights sourced from GPL code. Freelance artists took that complaint and ran with it. And while this is technically an argument that rests on copyright for legitimacy; the people who actually own most of the copyrights - publishers - are strangely interested about these machines that steal vast amounts of their work.
  
  Reply View | 0 replies
  
  larodi 10 months ago
  
  Interestingly there should be one which is missing which is well appropriate unless everyone is super smart math professor level genius:
  These papers become increasingly difficult to properly comprehend.
  …and thus perhaps the plethora of arguably nonsensical follow ups.
  
  Reply View | 2 replies
  
  aurareturn 10 months ago
  
  >* Claiming it's predictive text engine that isn't useful for anything
  This one is very common on HN and it's baffling. Even if it's predictive text, who the hell cares if it achieves its goals? If an LLM is actually a bunch of dolphins typing on a keyboard made for dolphins, I could care less if it does what I need it to do. For people who continue to repeat this on HN, why? I just want to know out of my curiosity.
  >* AI will never be able to [some pretty achievable task]
  Also very common on HN.
  You forgot the "AI will never be able to do what a human can do in the exact way a human does it so AI will never achieve x".
  
  Reply View | 11 replies

tsimionescu 10 months ago

One question, if anyone knows the details: does this prove that there exists a single LLM that can approximate any function to arbitrary precision given enough CoT, or does it prove that for every function, there exists a Transformer that fits those criteria?

That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?

Reply View 4 replies

jstanley 10 months ago

Doesn't the latter imply the former?
If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.
It's the "Universal Turing Machine" for LLMs.
I wonder what's the LLM equivalent of the halting problem?

Reply View | 3 replies
- progval 10 months ago
  
  > It's the "Universal Turing Machine" for LLMs.
  A closer analogy is the Hutter Search (http://hutter1.net/ai/pfastprg.pdf), as it is also an algorithm that can solve any problem. And it is probably too inefficient to use in practice, like the Hutter Search.
  
  Reply View | 0 replies
- detourdog 10 months ago
  
  In the late ‘80s they were called expert systems.
  Most demonstrations were regarding troubleshooting large systems, industrial processes, and education.
  
  Reply View | 0 replies
- [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies

shawntan 10 months ago

Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility: https://arxiv.org/pdf/2310.07923

TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size. For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.

We really need to get away from constant depth architectures.

Reply View 2 replies

benkuykendall 10 months ago

> Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.

Reply View | 1 reply
- shawntan 10 months ago
  
  > The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens.
  You're right.
  Re: NAND of two inputs. Isn't this doable even by a single layer (no hidden layers) neural network?
  Re: Polynomial computable function. I'm assuming this makes no assumption of constant-depth.
  Because my entire point was that the result of this paper is not actually impressive AND covered by a previous paper. Hopefully that's clearer.
  
  Reply View | 0 replies

[removed] 10 months ago

[deleted]

Reply View 0 replies

ljsprague 10 months ago

Skynet's the limit.

Reply View 0 replies

__loam 10 months ago

> We have mathematically proven that transformers can solve any problem

We should require that you've passed an algorithms and a thermodynamics class before you can post.

Reply View 3 replies

nopinsight 10 months ago

To be clear I think the tweet is a bit exaggerated (and the word ‘performance’ there doesn’t take into account efficiency, for example) but I don’t have the time to read the full paper (just skimmed the abstract and conclusion). I quoted the tweet by an author for people to discuss since it’s still a fairly remarkable result.

Reply View | 0 replies
bonoboTP 10 months ago

This is an accepted ICLR paper by authors from Stanford, Toyota and Google. That's not a guarantee for anything, of course, but they likely know basic algorithms and the second law. You can certainly argue against their claims, but you need to put in the legwork.

Reply View | 1 reply
- __loam 10 months ago
  
  I don't think I should need to argue with the absurd claim that these can solve any problem.
  
  Reply View | 0 replies

riku_iki 10 months ago

> Remarkably, constant depth is sufficient.

I think article also says log(n) embedding size (width?) is required, where n is size of input.

Reply View 0 replies

candiddevmike 10 months ago

> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.

That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.

Wake me up when a LLM figures out stable fusion or room temperature superconductors.

Reply View 23 replies

krackers 10 months ago

I think you're misrepresenting the study. It builds upon previous work that examines the computation power of the transformer architecture from a circuit-complexity perspective. Previous work showed that the class of problems that a "naive" Transformer architecture could compute was within TC0 [1, 2] and as a consequence it was fundamentally impossible for transformers to solve certain classes of mathematical problems. This study actually provides a more realistic bound of AC0 (by analyzing the finite-precision case) which rules out even more problems, including such 'simple' ones as modular parity.
We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].
[1] https://news.ycombinator.com/item?id=35609652 [2] https://blog.computationalcomplexity.org/2023/02/why-cant-li... [3] https://arxiv.org/abs/2305.15408

Reply View | 3 replies
- shawntan 10 months ago
  
  Generally, literature on the computational power of the SAME neural architecture can differ on their conclusions based on their premises. Assuming finite precision will give a more restrictive result, and assuming arbitrary precision can give you Turing completeness.
  From a quick skim this seems like it's making finite precision assumptions? Which doesn't actually tighten previous bounds, it just makes different starting assumptions.
  Am author of [1].
  
  Reply View | 2 replies
  
  krackers 10 months ago
  
  Ah my bad, great catch! I've updated my comment accordingly.
  
  Reply View | 1 reply
  
  shawntan 10 months ago
  
  You can't really be blamed though, the language in the paper does seem to state what you originally said. Might be a matter of taste but I don't think it's quite accurate.
  The prior work they referenced actually did account for finite precision cases and why they didn't think it was useful to prove the result with those premises.
  In this work they simply argued from their own perspective why finite precision made more sense.
  The whole sub-field is kinda messy and I get quoted differing results all the time.
  Edit: Also, your original point stands, obviously. Sorry for nitpicking on your post, but I also just thought people should know more about the nuances of this stuff.
  
  Reply View | 0 replies
Horffupolde 10 months ago

It is actually impressive.
One could argue that writing enabled chain of thought across generations.

Reply View | 0 replies
Veedrac 10 months ago

> Wake me up when a LLM figures out stable fusion or room temperature superconductors.
Man, the goalposts these days.

Reply View | 2 replies
- FeepingCreature 10 months ago
  
  "I love [goalposts]. I love the whooshing noise they make as they go by." --Douglas Adams, slightly adjusted
  
  Reply View | 0 replies
- WalterSear 10 months ago
  
  Shh!! It's working! It's working!
  
  Reply View | 0 replies
whimsicalism 10 months ago

it's a TCS result.
seems like many commenting don't know about computability

Reply View | 0 replies
WalterSear 10 months ago

> You can say the same thing about humans
1. Holy shit.
2. You can't apply Moore's law to humans.

Reply View | 2 replies
- Tostino 10 months ago
  
  You can't to chips any more either.
  Density has continued to increase, but so have prices. The 'law' was tied to the price to density ratio, and it's been almost a decade now since it died.
  
  Reply View | 0 replies
- gryn 10 months ago
  
  > 2. You can't apply Moore's law to humans.
  not with that attitude. /s
  if you take reproduction into account and ignore all the related externalities you can definitely double your count of transistors (humans) every two years.
  
  Reply View | 0 replies
aurareturn 10 months ago

> You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
The major assumption here is that transformers can indeed solve every problem humans can.

Reply View | 10 replies
- wizzwizz4 10 months ago
  
  > Isn’t this a good thing since compute and be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
  But it can't. There isn't enough planet.
  > The major assumption here is that transformers can indeed solve every problem humans can.
  No, the major assumptions are (a) that ChatGPT can, and (b) that we can reduce the resource requirements by many orders of magnitude. The former assumption is highly-dubious, and the latter is plainly false.
  Transformers are capable of representing any algorithm, if they're allowed to be large enough and run large enough. That doesn't give them any special algorithm-finding ability, and finding the correct algorithms is the hard part of the problem!
  
  Reply View | 4 replies
  
  aurareturn 10 months ago
  
  > But it can't. There isn't enough planet.
  How much resource are you assuming an AGI would consume?
  
  Reply View | 3 replies
- visarga 10 months ago
  
  > Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
  That is wrong, it misses the point. We learn from the environment, we don't secrete quantum gravity from our pure brains. It's a RL setting of exploration and exploitation, a search process in the space of ideas based on validation in reality. A LLM alone is like a human locked away in a cell, with no access to test ideas.
  If you take child Einstein and put him on a remote island, and come back 30 years later, do you think he would impress you with is deep insights? It's not the brain alone that made Einstein so smart. It's also his environment that had a major contribution.
  
  Reply View | 4 replies
  
  exe34 10 months ago
  
  if you told child Einstein that light travels at a constant speed in all inertial frames and taught him algebra, then yes, he would come up with special relativity.
  in general, an AGI might want to perform experiments to guide its exploration, but it's possible that the hypotheses that it would want to check have already been probed/constrained sufficiently. which is to say, a theoretical physicist might still stumble upon the right theory without further experiments.
  
  Reply View | 1 reply
  
  westurner 10 months ago
  
  Labeling of observations better than a list of column label strings at the top would make it possible to mine for insights in or produce a universal theory that covers what has been observed instead of the presumed limits of theory.
  CSVW is CSV on the Web as Linked Data.
  With 7 metadata header rows at the top, a CSV could be converted to CSVW; with URIs for units like metre or meter or feet.
  If a ScholarlyArticle publisher does not indicate that a given CSV or better :Dataset that is :partOf an article is a :premiseTo the presented argument, a human grad student or an LLM needs to identify the links or textual citations to the dataset CSV(s).
  Easy: Identify all of the pandas.read_csv() calls in a notebook,
  Expensive: Find the citation in a PDF, search for the text in "quotation marks" and try and guess which search result contains the dataset premise to an article;
  Or, identify each premise in the article, pull the primary datasets, and run an unbiased automl report to identify linear and nonlinear variance relations and test the data dredged causal chart before or after manually reading an abstract.
  
  Reply View | 0 replies
  
  aurareturn 10 months ago
  
  Assumption is that the AGI can solve any problem humans can - including learning from the environment if that is what is needed.
  But I think you're missing the point of my post. I don't want to devolve this topic into yet another argument centered around "but AI can't be AGI or can't do what humans can do because so and so".
  
  Reply View | 1 reply
  
  visarga 10 months ago
  
  I often see this misconception that compute alone will lead us to surpass human level. No doubt it is inspired by the "scaling laws" we heard so much about. People forget that imitation is not sufficient to surpass human level.
  
  Reply View | 0 replies

DarkNova6 10 months ago

The more interesting is whether the ability of reason and solve problems scales linearly or logarithmically.

Reply View 0 replies

[removed] 10 months ago

[deleted]

Reply View 0 replies

m3kw9 10 months ago

Sort of like quantum superposition state? So here is an idea, using quantum to produce all possible inferences and use some not yet invented algorithms to collapse to the final result

Reply View 0 replies

[removed] 10 months ago

[deleted]

Reply View 0 replies

tooltower 10 months ago

Constant depth circuits can solve everything? I feel like I missed some important part of circuit complexity. Or this is BS.

Reply View 2 replies

shawntan 10 months ago

Using CoT implicitly increases the depth of the circuit. But yes, poorly worded.

Reply View | 0 replies
whimsicalism 10 months ago

CoT means you're adding loops

Reply View | 0 replies