Comment by RossBencina

Comment by RossBencina 10 hours ago

The SemiAnalysis article that you linked to stated:

"OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome."

Given the overall quality of the article, that is an uncharacteristically convoluted sentence. At the risk of stating the obvious, "that was broadly deployed" (or not) is contingent on many factors, most of which are not of the GPU vs. TPU technical variety.

alecco 2 hours ago

My reading in between the lines is OpenAI's "GPT-5" is really a GPT-4 generation model. And this is aligned with it being unimpressive. Not the promised leap forward Altman promised.

Reply View 1 reply

[removed] 34 minutes ago

[deleted]

Reply View | 0 replies

nbardy 7 hours ago

This is misleading. They had 4.5 which was a new scaled up training run. It was a huge model and only served to pro users, but the biggest models are always used as teacher models for smaller models. Thats how you do distillation. It would be stupid to not use the biggest model you have in distillation and a waste since they have the weights.

The would have taken some time to calculate the efficiency gains of pretraining vs RL. Resumed the GPT-4.5 for whatever budget made sense and then spent the rest on RL.

Sure they chose to not serve the large base models anymore for cost reasons.

But I’d guess Google is doing the same. Gemini 2.5 samples very fast and seems way to small to be their base pre train. The efficiency gains in pertaining scale with model scale so it makes sense to train the largest model possible. But then the models end up super sparse and oversized and make little sense to serve in inference without distillation.

In RL the efficiency is very different because you have to inference sample the model to draw online samples. So small models start to make more sense to scale.

Big model => distill => RL

Makes the most theoretical sense for training now days for efficient spending.

So they already did train a big model 4.5. Not using it would have been absurd and they have a known recipe they could return scaling on if the returns were justified.

Reply View 2 replies

barrell 8 minutes ago

My understanding of 4.5 was that it was released long, long after the initial training run finished. It also had an older cutoff date than the newer 4o models

Reply View | 0 replies
copedetector 2 hours ago

[flagged]

Reply View | 0 replies