Comment by joshjob42

Comment by joshjob42 20 hours ago

In section 4 they discuss their projections specifically for model size, the state of inference chips in 2027, etc. It's largely pretty in line with expectations in terms of the capacity, and they only project them using 10k of their latest gen wafer scale inference chips by late 2027, roughly like 1M H100 equivalents. That doesn't seem at all impossible. They also earlier on discuss expectations for growth in efficiency of chips, and for growth in spending, which is only ~10x over the next 2.5 years, not unreasonable in absolute terms at all given the many tens of billions of dollars flooding in.

So on the "can we train the AI" front, they mostly are just projecting 2.5 years of the growth in scale we've been seeing.

The reason they predict a fairly hard takeoff is they expect that distillation, some algorithmic improvements, and iterated creation of synthetic data, training, and then making more synthetic data will enable significant improvements in efficiency of the underlying models (something still largely in line with developments over the last 2 years). In particular they expect a 10T parameter model in early 2027 to be basically human equivalent, and they expect it to "think" at about the rate humans do, 10 words/second. That would require ~300 teraflops of compute per second to think at that rate, or ~0.1H100e. That means one of their inference chips could potentially run ~1000 copies (or fewer copies faster etc. etc.) and thus they have the capacity for millions of human equivalent researchers (or 100k 40x speed researchers) in early 2027.

They further expect distillation of such models etc. to squeeze the necessary size down / more expensive models overseeing much smaller but still good models squeezing the effective amount of compute necessary, down to just 2T parameters and ~60 teraflops each, or 5000 human-equivalents per inference chip, making for up to 50M human-equivalents by late 2027.

This is probably the biggest open question and the place where the most criticism seems to me to be warranted. Their hardware timelines are pretty reasonable, but one could easily expect needing 10-100x more compute or even perhaps 1000x than they describe to achieve Nobel-winner AGI or superintelligence.

tsurba 14 hours ago

I don’t believe so. I think all important parts that each need to be scaled to advance significantly in the LLM paradigm are at or near the end of the steep part of the sigmoid:

1) useful training data available in the internet 2) number of humans creating more training data ”manually” 3) parameter scaling 4) ”easy” algorithmic inventions 5) available+buildable compute

”Just” needing a few more algorithmic inventions to keep the graphs exponential is a cop out. It is already obvious that just scaling parameters and compute is not enough.

I personally predict that scaling LLMs for solving all physical tasks (eg cleaning robots) or intellectual pursuits (they suck at multiplication) will not work out.

We will get better specialized tools by collecting data from specific, high economic value, constrained tasks, and automating them, but scaling a (multimodal) LLM to solve everything in a single model will not be economically viable. We will get more natural interfaces for many tasks.

This is how I think right now as a ML researcher, will be interesting to see how wrong was I in 2 years.

EDIT: addition about latest algorithmic advances:

- Deepseek style GRPO requires a ladder of scored problems progressively more difficult and appropriate to get useful gradients. For open-ended problems (like most interesting ones are) we have no ladders for, and it doesn’t work. In particular, learning to generate code for leetcode problems with a good number of well made unit tests is what it is good for.

- Test-time inference is just adding an insane amount of more compute after training to brute-force double-check the sanity of answers

Neither will keep the graphs exponential.

Reply View 0 replies