Comment by mynti

Comment by mynti 5 days ago

34 replies

They trained it in 33 days for ~20m (that includes apparently not only the infrastructure but also the salaries over a 6 month period). And the model is coming close to QWEN and Deepseek. Pretty impressive

zamadatix 4 days ago

The price/scaling of training another same class model always seems to be dropping through the floor but training models which score much better seems to be hitting a brick wall.

E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

The exception seems to be net new benchmarks/benchmark versions. These start out low and then either quickly get saturated or hit a similar wall after a while.

  • gwern 4 days ago

    > E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

    Why do you care about LM Arena? It has so many problems, and the fact that no one would suggest using GPT-4o for doing math or coding right now, or much of anything, should tell you that a 'win rate of 70%' does not mean whatever it looks like it means. (Does GPT-4o solve roughly as many Erdos questions as gemini-3-pro...? Can you write roughly as good poetry?)

    • zamadatix 4 days ago

      It'd certainly be odd if people were recommending old LLMs which score worse, even if marginally. That said, 4o is really a lot more usable than you're making it out to be.

      The particular benchmark in the example is fungible but you have to pick something to make a representative example. No matter which you pick someone always has a reason "oh, it's not THAT benchmark you should look at". The benchmarks from the charts in the post exhibit the same as described above.

      If someone was making new LLMs which were consistently solving Erdos problems at rapidly increasing rates then they'd be showing how it does that rather than showing how it scores the same or slightly better on benchmarks. Instead the progress is more like years since we were surprised LLMs were writing poetry to massage out an answer to one once. Maybe by the end of the year a few. The progress has definitely become very linear and relatively flat compared to roughly the initial 4o release. I'm just hoping that's a temporary thing rather than a sign it'll get even flatter.

      • nl 4 days ago

        Progress has not become linear. We've just hit the limits of what we can measure and explain easily.

        One year ago coding agents could barely do decent auto-complete.

        Now they can write whole applications.

        That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.

        Don't forget Llama4 led Lmarena and turned out to be very weak.

      • refulgentis 4 days ago

        Frankly, this reads as a lot of words that amount to an excuse for using only LMArena, and the rationale is quite clear: it’s for an unrelated argument that isn’t going to ring true to people, especially an audience of programmers who just spent the last year watching the AI go from being able to make coherent file edits to multi hour work.

        LMArena is, de facto, a sycophancy and Markdown usage detector.

        Two others you can trust, off the top of my head, are LiveBench.ai and Artifical Analysis. Or even Humanity’s Last Exam results. (Though, frankly, I’m a bit suspicious of them. Can’t put my finger on why. Just was a rather rapid hill climb for a private benchmark over the last year.)

        FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

    • DoctorOetker 3 days ago

      It very sad there is so much gaming of metrics with LLMs.

      If we wish to avoid everyone creating benchmarks for themselves, then instead of predetermined benchmarks (public ones allow gaming, while publicly scored private ones require blind trust) we could use gradient descent on sentences to find disagreements between models, and then present them to human domain experts.

      At least it could be public without possibility of leaking (since the model creators don't yet know of all possible disagreements between LLM's, which ones will be selected for review by human experts)

  • Zababa 4 days ago

    >E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

    I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.

    One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).

  • YetAnotherNick 4 days ago

    > The exception seems to be net new benchmarks/benchmark versions.

    How is this an exception? If a genius and kindergarden student takes a test to add two single digit numbers how is that result any relevant? Even though adding single digit number is in the class of possible test.

    We can only look at non saturated test.

  • lumost 3 days ago

    It’s becoming clear that training a frontier model is a capex/infra problem. This problem involves data acquisition, compute, and salaries for the researchers familiar with the little nuances of training at this scale.

    For the same class model, you can train on more or less the same commodity datasets. Over time these datasets become more efficient to train on as errata are removed and the data is cleaner. The cost of dataset acquisition can be amortized and sometimes drops to 0 as the dataset is open sourced.

    Frontier models mean acquiring fresh datasets at unknown costs.

  • esskay 3 days ago

    Training costs might be coming down but costs for hardware that can run these models is still obscenely high and rising. We're still nowhere near a point where its realistically feasible to run a home LLM that doesn't feel like it's suffering with severe brain damage.

tgrowazay 4 days ago

> 2048 Nvidia B300 GPU

With average price of $6/hour that is $12,288/hour for whole cluster.

Times 33 days times 24 hours it comes out to be $9.7MM , assuming no discounts.

That leaves $10.3MM/6 months for salaries, which is 103 employees at $200k/year or 51 employee at $400k/year.

  • zamadatix 3 days ago

    It mentions it took 4 models to get there, so would that mean there were additional runs (and other steps/overheads) which were part of the cost separate from just the salaries in that time?

jychang 4 days ago

They didn't do something stupid like Llama 4 "one active expert", but 4 of 256 is very sparse. It's not going to get close to Deepseek or GLM level performance unless they trained on the benchmarks.

I don't think that was a good move. No other models do this.

esafak 3 days ago

I tried it a bit yesterday and it was pretty dumb: it failed to understand the order of jobs in a Github Action; i.e., a DAG. And that concluded my testing.

Der_Einzige 4 days ago

I'll straight up accuse them of on purpose muddying the waters. To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run. They spent well over 100 Million to train this model by that definition, and all definitions which don't include the failed runs up to the successful one at the end are at best disingenuous and at worst outright lies designed to trick investors into dumping Nvidia.

No, deepseek did not spend only 5.5 million for Deepseek V3. No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily. Deepseek total cost to make Deepseek V3 is also in the 100-400 million range when you count all of what's needed to get to the final training run.

Edit: (Can't post cus this site's "posting too fast" thing is really stupid/bad)

The only way I can get reliable information out of folks like you is to loudly proclaim something wrong on the internet. I'm just going to even more aggressively do that from now on to goad people like you to set the record straight.

Even if they only used TPUs, they sure as shit spent orders of magnitude more than they claim due to "count the failed runs too"

  • querez 4 days ago

    > No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily.

    You are wrong. Gemini was definitely trained entirely on TPU. Of course your point of "you need to count failed experiments, too". Is correct. But you seem to have misconceptions around how deepmind operates and what infra it possess. Deepmind (or barely any of Google internal stuff) runs on Borg, an internal cloud system, which is completely separate (and different) from gcp. Deepmind does not have access to any meaningful gcp resources. And Borg barely has any GPUs. At the time I left deepmind, the amount of tpu compute available was probably 1000x to 10000x larger than the amount of gpu compute. You would never even think of seriously using GPUs for neural net training, it's too limited (in terms of available compute) and expensive (in terms of internal resource allocation units), and frankly less well supported by internal tooling than tpu. Even for small, short experiments, you would always use TPUs.

    • YetAnotherNick 4 days ago

      Using TPU has the same opportunity cost as GPU. Just because they built something doesn't mean it's cheaper. If it is they can rent it cheaper to save money on paying billions of dollars to Nvidia.

      A big segment of the market just uses GPU/TPU to train LLMs, so they don't exactly need flexibility if some tool is well supported.

      • querez 3 days ago

        I assume TPU TCO is significantly cheaper than GPU TCO. At the same time, I also assume that market demand for GPUs is higher than TPUs (external tooling is just more suited to GPU -- e.g. I'm not sure what the Pytorch-on-TPU story is these days, but I'd be astounded if it's on par with their GPU support). So moving all your internal teams to TPUs means that all the GPUs can be allocated to GCP.

        • YetAnotherNick 3 days ago

          Just doesn't make sense. If you make significantly more money renting TPU, why not rent them cheaper to shift the customers(and save billions that you are giving to Nvidia). TPU right now isn't significantly more cheaper to external customer.

          Again I am talking about LLM training/inference which if I were to guess is more than half of the workload currently for which the switching cost is close to 0.

    • hansvm 4 days ago

      At least blessed teams we used GPUs when we were allowed, else CPUs. TPUs were basically banned in YT since they were reserved for higher priority purposes. Gemini was almost certainly trained with one, but I guarantee an ungodly amount of compute has gone into training neural nets with CPUs and GPUs.

  • Zababa 4 days ago

    >To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run.

    I get the sentiment, but then, do you count all the other experiments that were done by that company before specifically trying to train this model? All the experiments done by people in that company at other companies? Since they rely on that experience to train models.

    You could say "count everything that has been done since the last model release", but then for the same amount of effort/GPU, if you release 3 models does that divide each model cost by 3?

    Genuinely curious in how you think about this, I think saying "the model cost is the final training run" is fine as it seems standard ever since DeepSeek V3, but I'd be interested if you have alternatives. Possibly "actually don't even talk about model cost as it will always be misleading and you can never really spend the same amount of money to get the same model"?

  • maziyar 3 days ago

    i think it's very flattering to have done something with $20m that is so good people think it must have been a $100m!

iberator 4 days ago

Why even do such thing if there is free Google, chatgpt and dozen more models? Waste of money towards ultimate goal: global loss of jobs and destroying earth.