Chain of Thought empowers transformers to solve inherently serial problems

261 points by krackers 10 months ago

lsy 10 months ago

Note that for the purposes of this paper a “problem” just means a formally decidable problem or a formal language, and the proof is that by creatively arranging transformers you can make individual transformer runs behave like individual Boolean circuits. However, this is a long way from any practical application of transformers: for one thing, most problems we care about are not stated as formal languages, and we already have an exceptionally more efficient way to implement Boolean circuits.

Reply View 25 replies

shawntan 10 months ago

If a "problem we care about" is not stated as a formal language, does it mean it does not exist in the hierarchy of formal languages? Or is it just as yet unclassified?

Reply View | 23 replies
- tsimionescu 10 months ago
  
  It means that there are two problems: one, to formalize the problem as stated while capturing all relevant details, and two, solving the resulting formal problem. Until you solve problem one, you can't use formal methods to say anything about the problem (it's not even clear a priori that a problem is even solvable).
  Unfortunately, the task of a formalizing an informal problem is itself an informal problem that we don't know how to formalize, so we can't say much about it. So overall, we can't say much about how hard the general problem "given a problem statement from a human, solve that problem" is, whether any particular system (including a human!) can solve it and how long that might take with what resources.
  
  Reply View | 20 replies
  
  viraptor 10 months ago
  
  > task of a formalizing an informal problem is itself an informal problem
  I couldn't find details about this - do you know of a paper or some resource which digs into that idea?
  
  Reply View | 19 replies
- wslh 10 months ago
  
  My 2 cents: Since LLMs (Large Language Models) operate as at least a subset of Turing machines (which recognize recursively enumerable languages), the chain of thought (CoT) approach could be equivalent to or even more expressive than that subset. In fact, CoT could perfectly be a Turing machine.
  If we leave CoT aside for a moment, it's worth exploring the work discussed in the paper "Neural Networks and the Chomsky Hierarchy"[1], which analyzes how neural networks (including LLMs) map onto different levels of the Chomsky hierarchy, with a particular focus on their ability to recognize formal languages across varying complexity.
  [1] https://ar5iv.labs.arxiv.org/html/2207.02098v1
  
  Reply View | 1 reply
  
  flir 10 months ago
  
  > In fact, CoT could perfectly be a Turing machine.
  Are we going to need an infinite number of LLMs, arranged on a tape?
  
  Reply View | 0 replies
julienreszka 10 months ago

> most problems we care about are not stated as formal languages
then a way would be to translate them to formal language

Reply View | 0 replies

lgessler 10 months ago

I liked Yoav Goldberg snarky's quote tweet:

> next paper: transformers can solve any problem but on some of them they may compute indefinitely and never provide an answer

> (and you cannot tell in advance which is which!!)

https://twitter.com/yoavgo/status/1835802380589203802

Reply View 1 reply

teqsun 10 months ago

It reminds me of Busy Beaver

Reply View | 0 replies

sigmoid10 10 months ago

>Remarkably, constant depth is sufficient.

How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states? Since transformers also use fully connected layers, none of this should really come as a surprise. But from glancing at the paper, they don't even mention it.

Reply View 3 replies

nexustext 10 months ago

It's 'remarkable' because (a) academic careers are as much about hype as science, (b) arxiv doesn't have peer review process to quash this, (c) people take arxiv seriously.

Reply View | 0 replies
logicchains 10 months ago

>How would that be remarkable, when it is exactly what he Universal Approximation Theorem already states
Only with infinite precision, which is highly unrealistic. Under realistic assumptions, fixed depth transformer without chain-of-thought are very limited in what they can express: https://arxiv.org/abs/2207.00729 . Chain of thought increases the class of problems which fixed depth transformers can solve: https://arxiv.org/abs/2310.07923

Reply View | 0 replies
IshKebab 10 months ago

The universal approximation theorem has no practical relevance.

Reply View | 0 replies

larodi 10 months ago

I'm waiting for peoples of AI to discover syllogism and inference in its original PROLOG sense, which this CoT abomination basically tries to achieve. Interestingly, if all logical content is translated to rules, and then only rules are fed into the LLM training set, what would the result be, and can the probabilistic magic be made into actually following reason without all the dice.

Reply View 9 replies

trescenzi 10 months ago

Right we’ve now gotten to the stage of this AI cycle where we start using the new tool to solve problems old tools could solve. Saying a transformer can solve any Formally decidable problem if given enough tape isn’t saying much. It’s a cool proof, don’t mean to deny that, but it doesn’t mean much practically as we already have more efficient tools that can do the same.

Reply View | 3 replies
- marcosdumay 10 months ago
  
  What I don't get is... didn't people prove that in the 90s for any multi-layer neural network? Didn't people prove transformers are equivalent on the transformers paper?
  
  Reply View | 2 replies
  
  Nevermark 10 months ago
  
  Yes they did. A two layer network with enough units in the hidden layer can form any mapping to any desired accuracy.
  And a two layer network with single-delay feedback from the hidden units to themselves can capture any dynamic behavior (to any desired accuracy).
  Adding layers and more structured architectures creates the opportunity for more efficient training and inference, but doesn't enable any new potential behavior. (Except in the sense that reducing resource requirements can allow impractical problems to become practical.)
  
  Reply View | 1 reply
  
  larodi 10 months ago
  
  Putting a 50 bucks bet that some very smart kid in the near future will come with some enthrophy-meets-graphical-structures theorem which gives an estimation of how the loss of information is affected by the size and type of the underlying structure holding this information.
  It took a while for people to actually start talking about LZW as grammar algo, not a "dictionary"-based algorithm. Which is then reasoned about in a more general sense again by https://en.wikipedia.org/wiki/Sequitur_algorithm.
  This is not to say that LLMs are not cool, we put them to use every day. But the reasoning part is never going to be a trustworthy one without a 100% discreet system, which can infer the syllogistic chain with zero doubt and 100% tracable origin.
  
  Reply View | 0 replies
sunir 10 months ago

I was thinking about the graphrag paper and prolog. I’d like to extract predicates. The source material will be inconsistent and contradictory and incomplete.
Using the clustering (community) model, an llm can summarize the opinions as a set of predicates which don’t have to agree and some general weight of how much people agree or disagree with them.
The predicates won’t be suitable for symbolic logic because the language will be loose. However an embedding model may be able to connect different symbols together.
Then you could attempt multiple runs through the database of predicates because there will be different opinions.
Then one could attempt to reason using these loosely stitched predicates. I don’t know how good the outcome would be.
I imagine this would be better in an interactive decision making tool where a human is evaluating the suggestions for the next step.
This could be better for planning than problem solving.

Reply View | 1 reply
- larodi 10 months ago
  
  Hm... a RAG over DB of logical rules actually may be interesting. But loosely stitched predicates you can easily put to work with some random dice when you decide inference.
  Chris Coyne of OKCupid and KeyBase (https://chriscoyne.com/) produced ContextFree (https://www.contextfreeart.org/) before all that. It is a grammar-based inference with probabilistic chance for the inference of the next rule. Very very very inspiring, not only because of the aesthetic side of the result. Digging further you find ProbLog which allows probabilities for rules (https://dtai.cs.kuleuven.be/problog/tutorial/basic/08_rule_p...)
  So how about we start thinking of AI as combination of the graphical probabilistic whatever which compresses the infromation from the training set in a very lossy manner; which is then hooked, internally or externally, with a discreet logical core, whenever CoT is needed. So this construct now can benefit from both worlds.
  
  Reply View | 0 replies
pkoird 10 months ago

I've said this before and I'll say it again: Any sufficiently advanced LLM is indistinguishable from Prolog.

Reply View | 0 replies
detourdog 10 months ago

I’m surprised that understanding how to be thought unfolds is being considered not relevant to the answer. I have done a lot of problem solving in groups and alone. How thoughts develop seems fundamental to understand the solutions.
The story regarding the banning of terms that can be used with a reasoning system is a big red flag to me.
This sort of knee jerk reaction displays immature management and an immature technology product.

Reply View | 1 reply
- larodi 10 months ago
  
  a little late to reply, but perhaps you see this. does it not make impression to you that lots of these articles on AI that get published are very childish. not in the math sense, but in the rasoning sense. besides, most of them are anything but interdisciplinary. I've almost never encountred prompt engineers who actually tried to delve into what GPTs do, and then these CoT guys, they don't know a thing about predicat logic, yet try to invent it anew.
  On your comment reg. banning tokens/terms we are on the same page. We can agree all of this is very immature, and many of the peoples also, including this lot of chinese kids who seem to put out one paper per our. You see, the original seq2seq paper is 8 pages, topic included. Can you imagine? But Sutskever was ot a child back then, he was already deep into this all. We can easily state/assume the LLM business is in its infancy. It may easily stay there for a century until everyone levels up.
  
  Reply View | 0 replies

wodenokoto 10 months ago

But didn't we already know that NN can solve any computable problem? The interesting thing is if they can be trained to solve any (computable) problem.

Reply View 6 replies

imhoguy 10 months ago

I don't know why I have read "HN", indeed HN can solve any problem.

Reply View | 0 replies
tossandthrow 10 months ago

Feed forward NNs can approximate all functions f: X -> Y only for closed domains.
But recurrent neural networks can do solve any computational problem given enough precision.

Reply View | 3 replies
- roboboffin 10 months ago
  
  Does that mean when we reduce the precision of a NN, for example using bfloat16 instead of float32, we reduce the set of computational problems that can be solved.
  How would that compare with a biological neural network with presumably near-infinite precision ?
  
  Reply View | 0 replies
- wodenokoto 10 months ago
  
  First day of introductions to NN we were asked to create all the logic gates using artificial neurons, and then told "If you have all gates, you can do all computations".
  I got to admit, I'm sorta sticking to that at face value, because I don't know enough computer science to a) discern if that is true and b) know what "f: X -> Y only for closed domains" means.
  
  Reply View | 1 reply
  
  tossandthrow 10 months ago
  
  I think the easiest way to think about this is in terms of natural numbers, ie. 1, 2, 3, 4.
  When you only have a fixed width, ie. a static feed forward network, you have an upper limit to the data you can represent and compute on.
  Eg. if the highest number you can represent is 1.000, then you will need a new NN if you want to do computations on 1.001.
  ... or use an inductive structure, like a recurrent neural network has.
  
  Reply View | 0 replies
logicchains 10 months ago

Only NNs of infinite size or precision. Under more realistic assumptions, transformers without chain of thought are actually limited in what they can solve: https://arxiv.org/abs/2207.00729

Reply View | 0 replies

nopinsight 10 months ago

In the words of an author:

"What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

http://arxiv.org/abs/2402.12875 (ICLR 2024)"

https://x.com/denny_zhou/status/1835761801453306089

Reply View 86 replies

ec109685 10 months ago

Is this the infinite monkey Shakespeare trope?

Reply View | 39 replies
- throwup238 10 months ago
  
  More like the universal approximation theorem extended to computation rather than network complexity: https://en.wikipedia.org/wiki/Universal_approximation_theore...
  
  Reply View | 1 reply
  
  immibis 10 months ago
  
  The universal approximation theorem is good to know because says there's no theoretical upper bound to a function-approximating NN's accuracy. In practice it says nothing about what can be realistically achieved, though.
  
  Reply View | 0 replies
- nopinsight 10 months ago
  
  A key difference is that the way LMMs (Large Multimodal Models) generate output is far from random. These models can imitate/blend existing information or imitate/probably blend known reasoning methods in the training data. The latter is a key distinguishing feature of the new OpenAI o1 models.
  Thus, the signal-to-noise ratio of their output is generally way better than infinite monkeys.
  Arguably, humans rely on similar modes of "thinking" most of the time as well.
  
  Reply View | 0 replies
- CamperBob2 10 months ago
  
  Yeah. Monkeys. Monkeys that write useful C and Python code that needs a bit less revision every time there's a model update.
  Can we just give the "stochastic parrot" and "monkeys with typewriters" schtick a rest? It made for novel commentary three or four years ago, but at this point, these posts themselves read like the work of parrots. They are no longer interesting, insightful, or (for that matter) true.
  
  Reply View | 35 replies
  
  visarga 10 months ago
  
  If you think about it, humans necessarily use abstractions, from the edge detectors in retina to concepts like democracy. But do we really understand? All abstractions leak, and nobody knows the whole stack. For all the poorly grasped abstractions we are using, we are also just parroting. How many times are we doing things because "that is how they are done" never wondering why?
  Take ML itself, people are saying it's little more than alchemy (stir the pile). Are we just parroting approaches that have worked in practice without real understanding? Is it possible to have centralized understanding, even in principle, or is all understanding distributed among us? My conclusion is that we have a patchwork of partial understanding, stitched together functionally by abstractions. When I go to the doctor, I don't study medicine first, I trust the doctor. Trust takes the place of genuine understanding.
  So humans, like AI, use distributed and functional understanding, we don't have genuine understanding as meant by philosophers like Searle in the Chinese Room. No single neuron in the brain understands anything, but together they do. Similarly, no single human understands genuinely, but society together manages to function. There is no homunculus, no centralized understander anywhere. We humans are also stochastic parrots of abstractions we don't really grok to the full extent.
  
  Reply View | 5 replies
  
  kaechle 10 months ago
  
  Every time I read "stochastic parrot," my always-deterministic human brain surfaces this quote:
  > “Most people are other people. Their thoughts are someone else's opinions, their lives a mimicry, their passions a quotation.”
  - Oscar Wilde, a great ape with a pen
  
  Reply View | 2 replies
  
  hegFdH 10 months ago
  
  The infinite monkey post was in response to this claim, which, like the universal approximation theorem, is useless in practice:
  "We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient."
  Like an LLM, you omit the context and browbeat people with the "truth" you want to propagate. Together with many political forbidden terms since 2020, let us now also ban "stochastic parrot" in order to have a goodbellyfeel newspeak.
  
  Reply View | 1 reply
  
  chaosist 10 months ago
  
  There is also a problem of "stochastic parrot" being constantly used in a pejorative sense as opposed to a neutral term to keep grounded and skeptical.
  Of course, it is an overly broad stroke that doesn't quite capture all the nuance of the model but the alternative of "come on guys, just admit the model is thinking" is much worse and has much less to do with reality.
  
  Reply View | 0 replies
  
  ffsm8 10 months ago
  
  > novel commentary three or four years ago,
  Chatgpt was released November 2022. That's one year and 10 months ago. Their marketing started in the summer of the same year, still far of from 3-4 years.
  
  Reply View | 6 replies
  
  93po 10 months ago
  
  AI news article comments bingo card:
  * Tired ClosedAI joke
  * Claiming it's predictive text engine that isn't useful for anything
  * Safety regulations are either good or bad, depending on who's proposing them
  * Fear mongering about climate impact
  * Bringing up Elon for no reason
  * AI will never be able to [some pretty achievable task]
  * Tired arguments from pro-IP / copyright sympathizers
  
  Reply View | 16 replies
tsimionescu 10 months ago

One question, if anyone knows the details: does this prove that there exists a single LLM that can approximate any function to arbitrary precision given enough CoT, or does it prove that for every function, there exists a Transformer that fits those criteria?
That is, does this prove that a single LLM can solve any problem, or that for any problem, we can find an LLM that solves it?

Reply View | 4 replies
- jstanley 10 months ago
  
  Doesn't the latter imply the former?
  If it's possible to find an LLM for any given problem, then find an LLM for the problem "find an LLM for the problem and then evaluate it" and then evaluate it, and then you have an LLM that can solve any problem.
  It's the "Universal Turing Machine" for LLMs.
  I wonder what's the LLM equivalent of the halting problem?
  
  Reply View | 3 replies
  
  progval 10 months ago
  
  > It's the "Universal Turing Machine" for LLMs.
  A closer analogy is the Hutter Search (http://hutter1.net/ai/pfastprg.pdf), as it is also an algorithm that can solve any problem. And it is probably too inefficient to use in practice, like the Hutter Search.
  
  Reply View | 0 replies
  
  detourdog 10 months ago
  
  In the late ‘80s they were called expert systems.
  Most demonstrations were regarding troubleshooting large systems, industrial processes, and education.
  
  Reply View | 0 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
shawntan 10 months ago

Theoretical results exist that try to quantify the number of CoT tokens needed to reach different levels of computational expressibility: https://arxiv.org/pdf/2310.07923
TL;DR: Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size. For a field that constantly harps on parallelism and compute efficiency, this requirement seems prohibitive.
We really need to get away from constant depth architectures.

Reply View | 2 replies
- benkuykendall 10 months ago
  
  > Getting to Turing completeness can require polynomial CoT tokens, wrt the input problem size.
  So, as stated, this is impossible since it violates the Time Hierarchy Theorem.
  The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens. Which is... not a particularly impressive bound? Any non-trivial fixed neural network can, for instance, compute the NAND of two inputs. And any polynomial computable function can be computed with a polynomial number of NAND gates.
  
  Reply View | 1 reply
  
  shawntan 10 months ago
  
  > The actual result of the paper is that any poly-time computable function can be computed with poly-many tokens.
  You're right.
  Re: NAND of two inputs. Isn't this doable even by a single layer (no hidden layers) neural network?
  Re: Polynomial computable function. I'm assuming this makes no assumption of constant-depth.
  Because my entire point was that the result of this paper is not actually impressive AND covered by a previous paper. Hopefully that's clearer.
  
  Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies
ljsprague 10 months ago

Skynet's the limit.

Reply View | 0 replies
__loam 10 months ago

> We have mathematically proven that transformers can solve any problem
We should require that you've passed an algorithms and a thermodynamics class before you can post.

Reply View | 3 replies
- nopinsight 10 months ago
  
  To be clear I think the tweet is a bit exaggerated (and the word ‘performance’ there doesn’t take into account efficiency, for example) but I don’t have the time to read the full paper (just skimmed the abstract and conclusion). I quoted the tweet by an author for people to discuss since it’s still a fairly remarkable result.
  
  Reply View | 0 replies
- bonoboTP 10 months ago
  
  This is an accepted ICLR paper by authors from Stanford, Toyota and Google. That's not a guarantee for anything, of course, but they likely know basic algorithms and the second law. You can certainly argue against their claims, but you need to put in the legwork.
  
  Reply View | 1 reply
  
  __loam 10 months ago
  
  I don't think I should need to argue with the absurd claim that these can solve any problem.
  
  Reply View | 0 replies
riku_iki 10 months ago

> Remarkably, constant depth is sufficient.
I think article also says log(n) embedding size (width?) is required, where n is size of input.

Reply View | 0 replies
candiddevmike 10 months ago

> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.
That seems like a bit of a leap here to make this seem more impressive than it is (IMO). You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
Wake me up when a LLM figures out stable fusion or room temperature superconductors.

Reply View | 23 replies
- krackers 10 months ago
  
  I think you're misrepresenting the study. It builds upon previous work that examines the computation power of the transformer architecture from a circuit-complexity perspective. Previous work showed that the class of problems that a "naive" Transformer architecture could compute was within TC0 [1, 2] and as a consequence it was fundamentally impossible for transformers to solve certain classes of mathematical problems. This study actually provides a more realistic bound of AC0 (by analyzing the finite-precision case) which rules out even more problems, including such 'simple' ones as modular parity.
  We also had previous work that hinted that part of the reason why chain-of-thought works from a theoretical perspective is that it literally allows the model to perform types of computations it could not under the more limited setting (in the same way jumping from FSMs to pushdown automata allows you to solve new types of problems) [3].
  [1] https://news.ycombinator.com/item?id=35609652 [2] https://blog.computationalcomplexity.org/2023/02/why-cant-li... [3] https://arxiv.org/abs/2305.15408
  
  Reply View | 3 replies
  
  shawntan 10 months ago
  
  Generally, literature on the computational power of the SAME neural architecture can differ on their conclusions based on their premises. Assuming finite precision will give a more restrictive result, and assuming arbitrary precision can give you Turing completeness.
  From a quick skim this seems like it's making finite precision assumptions? Which doesn't actually tighten previous bounds, it just makes different starting assumptions.
  Am author of [1].
  
  Reply View | 2 replies
- Horffupolde 10 months ago
  
  It is actually impressive.
  One could argue that writing enabled chain of thought across generations.
  
  Reply View | 0 replies
- Veedrac 10 months ago
  
  > Wake me up when a LLM figures out stable fusion or room temperature superconductors.
  Man, the goalposts these days.
  
  Reply View | 2 replies
  
  FeepingCreature 10 months ago
  
  "I love [goalposts]. I love the whooshing noise they make as they go by." --Douglas Adams, slightly adjusted
  
  Reply View | 0 replies
  
  WalterSear 10 months ago
  
  Shh!! It's working! It's working!
  
  Reply View | 0 replies
- whimsicalism 10 months ago
  
  it's a TCS result.
  seems like many commenting don't know about computability
  
  Reply View | 0 replies
- WalterSear 10 months ago
  
  > You can say the same thing about humans
  1. Holy shit.
  2. You can't apply Moore's law to humans.
  
  Reply View | 2 replies
  
  Tostino 10 months ago
  
  You can't to chips any more either.
  Density has continued to increase, but so have prices. The 'law' was tied to the price to density ratio, and it's been almost a decade now since it died.
  
  Reply View | 0 replies
  
  gryn 10 months ago
  
  > 2. You can't apply Moore's law to humans.
  not with that attitude. /s
  if you take reproduction into account and ignore all the related externalities you can definitely double your count of transistors (humans) every two years.
  
  Reply View | 0 replies
- aurareturn 10 months ago
  
  > You can say the same thing about humans, provided they are allowed to think across as many years/generations as needed.
  Isn’t this a good thing since compute can be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
  Say humans can solve quantum gravity in 100 years of thinking by 10,000 really smart people. If one AGI is equal to 1 really smart person. Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
  The major assumption here is that transformers can indeed solve every problem humans can.
  
  Reply View | 10 replies
  
  wizzwizz4 10 months ago
  
  > Isn’t this a good thing since compute and be scaled so that the LLM can do generations of human thinking in a much shorter amount of time?
  But it can't. There isn't enough planet.
  > The major assumption here is that transformers can indeed solve every problem humans can.
  No, the major assumptions are (a) that ChatGPT can, and (b) that we can reduce the resource requirements by many orders of magnitude. The former assumption is highly-dubious, and the latter is plainly false.
  Transformers are capable of representing any algorithm, if they're allowed to be large enough and run large enough. That doesn't give them any special algorithm-finding ability, and finding the correct algorithms is the hard part of the problem!
  
  Reply View | 4 replies
  
  visarga 10 months ago
  
  > Scale enough compute for 1 million AGI and we can solve quantum gravity in a year.
  That is wrong, it misses the point. We learn from the environment, we don't secrete quantum gravity from our pure brains. It's a RL setting of exploration and exploitation, a search process in the space of ideas based on validation in reality. A LLM alone is like a human locked away in a cell, with no access to test ideas.
  If you take child Einstein and put him on a remote island, and come back 30 years later, do you think he would impress you with is deep insights? It's not the brain alone that made Einstein so smart. It's also his environment that had a major contribution.
  
  Reply View | 4 replies
DarkNova6 10 months ago

The more interesting is whether the ability of reason and solve problems scales linearly or logarithmically.

Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies
m3kw9 10 months ago

Sort of like quantum superposition state? So here is an idea, using quantum to produce all possible inferences and use some not yet invented algorithms to collapse to the final result

Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies
tooltower 10 months ago

Constant depth circuits can solve everything? I feel like I missed some important part of circuit complexity. Or this is BS.

Reply View | 2 replies
- shawntan 10 months ago
  
  Using CoT implicitly increases the depth of the circuit. But yes, poorly worded.
  
  Reply View | 0 replies
- whimsicalism 10 months ago
  
  CoT means you're adding loops
  
  Reply View | 0 replies

JSDevOps 10 months ago

So given infinite time and resources it can solve any problem? Hardly groundbreaking is it.

Reply View 8 replies

mrbungie 10 months ago

The "We need more Nvidia GPUs and we will reach AGI" theorem.

Reply View | 0 replies
dilyevsky 10 months ago

Infinite Token Theorem

Reply View | 5 replies
- JTyQZSnP3cQGa8B 10 months ago
  
  Rendered useless by the infinite money problem.
  
  Reply View | 2 replies
  
  observationist 10 months ago
  
  $7 Trillion dollars is all you need.
  
  Reply View | 0 replies
  
  dotancohen 10 months ago
  
  Rendered obsolete by the Desperate Venture Capitalist syndrome.
  See: MS investing in OpenAI
  
  Reply View | 0 replies
- bilekas 10 months ago
  
  Nice quip but in reality it's the exact same right?
  
  Reply View | 0 replies
- zeofig 10 months ago
  
  I have faith that Nvidia can sell them infinity gpus
  
  Reply View | 0 replies
imhoguy 10 months ago

Now it is time to prove can it loop into creating an infinite number of paperclips? /s https://en.wikipedia.org/wiki/Instrumental_convergence#Paper...

Reply View | 0 replies

mg 10 months ago

Has it been publicly benchmarked yet, if this approach:

    Hello LLM, please solve this task: <task>

Can be improved by performing this afterwards?

    for iteration in range(10):
        Hello LLM, please solve this task: <task>
        Here is a possible solution: <last_reply>
        Please look at it and see if you can improve it.
        Then tell me your improved solution.

Reply View 5 replies

lorepieri 10 months ago

Not sure if it has been benchmarked, but I've called this technique the "for-loop of thought". :)

Reply View | 0 replies
bachback 10 months ago

for coding tasks see
https://aider.chat/docs/leaderboards/
the question is how would you define "improve" and "solve". RLHF in a way delegates this to humans.

Reply View | 0 replies
Kiro 10 months ago

Isn't that the whole reason that o1 works?

Reply View | 1 reply
- ben_w 10 months ago
  
  I think o1 is more like "pretend you're doing a job interview, think step and show your working".
  I tried something similar to the suggested iterative loop on a blog post I'd authored but wanted help copy editing; first few were good enough, but then it got very confused and decided the blog post wasn't actually a blog post to be edited and instead that what I really wanted to know was the implications of Florida something something Republican Party.
  Benchmark would be neat, because all I have is an anecdote.
  
  Reply View | 0 replies
eykrehbein 10 months ago

BruteforceLLM

Reply View | 0 replies

HarHarVeryFunny 10 months ago

Sure, in same sense as an infinitely long tape let's a Turing machine solve arbitrary problems. In theory at least. If one had the right program.

Reply View 7 replies

falcor84 10 months ago

It's not clear me what you're saying; isn't the whole deal here that by performing RL on the CoT (given sufficient size and compute) it would converge to the right program?

Reply View | 6 replies
- HarHarVeryFunny 10 months ago
  
  I was really saying two things:
  1) The theoretical notion that a fixed depth transformer + COT can solve arbitrary problems involving sequential computation is rather like similar theoretical notions of a Turing machine as universal computer, or of an ANN with a hidden layer able to represent arbitrary functions .. it may be true, but at the same time not useful
  2) The Turing machine, just as the LLM+COT, is only as useful as the program it is running. If the LLM-COT is incapable of runtime learning and just trying to mimic some reasoning heuristics, then that is going to limit it's function, even if theoretically such an "architecture" could do more if only it were running a universal AGI program
  Using RL to encourage the LLM to predict continuations according to some set of reasoning heuristics is what it is. It's not going to make the model follow any specific reasoning logic, but is presumably hoped to generate a variety of continuations that the COT "search" will be able to utilize to arrive at a better response than it otherwise would have done. More of an incremental improvement (as reflected in the benchmark scores it achieves) than "converging to the right program".
  
  Reply View | 0 replies
- __loam 10 months ago
  
  Sometimes reading hackernews makes me want to slam my head on the table repeatedly. Given sufficient size and compute is one of the most load bearing phrases I've ever seen.
  
  Reply View | 4 replies
  
  falcor84 10 months ago
  
  But it is load bearing. I mean, I personally can't stop being amazed at how with each year that passes, things that were unimaginable with all the world's technology a decade ago are becoming straightforward to run on a reasonably priced laptop. And at this stage, I wouldn't bet even $100 against any particular computational problem being solved in some FAANG datacenter by the end of the decade.
  
  Reply View | 3 replies

tossandthrow 10 months ago

> We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.

This is also the case with plain and regular RNNs

Reply View 4 replies

baq 10 months ago

Now just need an autoregressive transformer <==> RNN isomorphism paper and we're golden

Reply View | 2 replies
- logicchains 10 months ago
  
  Plain RNNs are theoretically weaker than transformers with COT: https://arxiv.org/abs/2402.18510 .
  
  Reply View | 1 reply
  
  tossandthrow 10 months ago
  
  The paper says transformers perform better than RNNs, which is not surprising.
  However, they are both, theoretically, Turing complete computers. So they are equally expressive.
  
  Reply View | 0 replies
Intralexical 10 months ago

Isn't it also expected to be the case with RNGs?

Reply View | 0 replies

smusamashah 10 months ago

https://x.com/ctjlewis/status/1786948443472339247

"Running cellular automata and other programs on Claude 3 Opus."

Its one of the replies on this tweet.

Reply View 0 replies

seydor 10 months ago

'can'

But will they? I believe the frontier has moved to making them make sense instead of just making infinite language.

The infinite monkey problem is not solved yet

Reply View 0 replies

scotty79 10 months ago

Chain of thought GPT is sort of a Turing machine with a tape that it's allowed to write to for purposes other than outputting the answer.

Reply View 0 replies

smusamashah 10 months ago

A reply in this twitter thread links to a detailed blog post titled "Universal computation by attention: Running cellular automata and other programs on Claude 3 Opus." https://x.com/ctjlewis/status/1786948443472339247

Reply View 0 replies

phemartin 10 months ago

Google Illuminate Summary Chat: https://illuminate.google.com/library?play=UzUI5b_HF8UI

Reply View 0 replies

floppiplopp 10 months ago

They have also mathematically proven that transformers are great randomness generators.

Reply View 0 replies

cpldcpu 10 months ago

Can any of these tools do anything that the Github copilot cannot do? (Apart from using other models?). I tried Continue.dev and cursor.ai, but it was not immediately obvious to me. Maybe I am missing something workflow specific?

Reply View 0 replies

empath75 10 months ago

Is this more general than LLMs? Is it possible to do something Chain-of-Thought-like in a transformer model that _isn't_ trained on language?

Reply View 0 replies

glial 10 months ago

Apologies if this is a dumb question, but aren't all computations inherently serial? In that a Turing machine performs operations serially?

Reply View 7 replies

joe_the_user 10 months ago

Aren't all computations inherently serial?
No. "inherently serial" refers to problems that are specified serially and can't be spend up by parallel processing. The sum of a set of N numbers is an example of a problem that is not inherently serial. You can use parallel reduction to perform the computation in O(log(N)) time on an idealized parallel computer but it takes O(N) time on an idealized serial computer.
And, it turns, exactly which problems are really are inherently serial is somewhat challenging problem.

Reply View | 3 replies
- visarga 10 months ago
  
  > The sum of a set of N numbers is an example of a problem that is not inherently serial.
  But addition with floats (not reals) is non associative.
  
  Reply View | 2 replies
  
  immibis 10 months ago
  
  They didn't say floats, and the sum of a set of floats is not uniquely defined as a float for the rain you stated, at least not without specifying a rounding mode. Most people use "round to whatever my naïve code happens to do" which has many correct answers. To add up a set of floats with only the usual 0.5ULP imprecision, yes, isn't trivial.
  
  Reply View | 0 replies
  
  rand_r 10 months ago
  
  Using hardware floating point types is not suitable if mathematical correctness matters, and is largely a deprecated practice. Check out Python’s fraction module for example, for exact arithmetic[0].
  [0]: https://www.geeksforgeeks.org/fraction-module-python/
  
  Reply View | 0 replies
tromp 10 months ago

Turing Machines are just one of many computational models. Others offer more parallelism. Two examples:
In lambda calculus, disjoint redexes can be reduced in parallel.
And in interaction nets, all active pairs can be reduced in parallel [1].
[]1 https://en.wikipedia.org/wiki/Interaction_nets

Reply View | 0 replies
ants_everywhere 10 months ago

You can model parallel computation by an arbitrary finite product of Turing machines. And then, yes, you can simulate that product on a single Turing machine. I think that's the sort of thing you have in mind?
But I'm not aware of what "inherently serial" means. The right idea likely involves talking about complexity classes. E.g. how efficiently does a single Turing machine simulate a product of Turing machines? An inherently serial computation would then be something like a problem where the simulation is significantly slower than running the machines in parallel.

Reply View | 0 replies
ninetyninenine 10 months ago

Yeah it's talking about a new feature for LLMs where the output of an LLM is fed back in as input and done again and again and again and this produces way more accurate output.

Reply View | 0 replies

tonii141 10 months ago

Random generator of tokens can also solve any problem if you give it enough time and memory.

Reply View 0 replies

qmatch 10 months ago

Is this similar to the Universal Approximator Theorem?

Reply View 0 replies

CarRamrod 10 months ago

Damn, we just used our entire Round A acquiring an infinite amount of bananas and typewriter ink. The boss is not going to like this.

Reply View 5 replies

nopinsight 10 months ago

No worries! With the magic bananas and ink you've acquired, those monkeys will surely produce output with a signal-to-noise ratio rivaling the best LLMs.
I’m sure your startup will achieve the coveted Apeicorn status soon!

Reply View | 0 replies
dotancohen 10 months ago

Naturally.
It's the printer ink that is forbiddingly expensive. And the bananas are carbon neutral.

Reply View | 0 replies
imjonse 10 months ago

Hopefully not Cavendish, as those are too sugary for monkeys and you'll just get hallucinations.

Reply View | 0 replies
bryanrasmussen 10 months ago

did you get both infinite bananas and infinite typewriter ink, or was there a limited supply of typewriter ink? If the first, it was worth it.

Reply View | 1 reply
- [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies

tonii141 10 months ago

[flagged]

Reply View 1 reply

[removed] 10 months ago

[deleted]

Reply View | 0 replies

theshrike79 10 months ago

Are we getting to a point where the LLM will just answer "42" and we need to figure out the question? =)

Reply View 0 replies

bottlepalm 10 months ago

Forget UBI, we're going to need Universal Basic Compute.

Reply View 0 replies