Comment by Der_Einzige

Comment by Der_Einzige 4 days ago

7 replies

I'll straight up accuse them of on purpose muddying the waters. To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run. They spent well over 100 Million to train this model by that definition, and all definitions which don't include the failed runs up to the successful one at the end are at best disingenuous and at worst outright lies designed to trick investors into dumping Nvidia.

No, deepseek did not spend only 5.5 million for Deepseek V3. No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily. Deepseek total cost to make Deepseek V3 is also in the 100-400 million range when you count all of what's needed to get to the final training run.

Edit: (Can't post cus this site's "posting too fast" thing is really stupid/bad)

The only way I can get reliable information out of folks like you is to loudly proclaim something wrong on the internet. I'm just going to even more aggressively do that from now on to goad people like you to set the record straight.

Even if they only used TPUs, they sure as shit spent orders of magnitude more than they claim due to "count the failed runs too"

querez 4 days ago

> No Gemini was not "entirely trained on TPUs". They did hundreds of experiments on GPUs to get to the final training run done entirely on TPUs. GCP literally has millions of GPUs and you bet your ass that the gemini team has access to them and uses them daily.

You are wrong. Gemini was definitely trained entirely on TPU. Of course your point of "you need to count failed experiments, too". Is correct. But you seem to have misconceptions around how deepmind operates and what infra it possess. Deepmind (or barely any of Google internal stuff) runs on Borg, an internal cloud system, which is completely separate (and different) from gcp. Deepmind does not have access to any meaningful gcp resources. And Borg barely has any GPUs. At the time I left deepmind, the amount of tpu compute available was probably 1000x to 10000x larger than the amount of gpu compute. You would never even think of seriously using GPUs for neural net training, it's too limited (in terms of available compute) and expensive (in terms of internal resource allocation units), and frankly less well supported by internal tooling than tpu. Even for small, short experiments, you would always use TPUs.

  • YetAnotherNick 4 days ago

    Using TPU has the same opportunity cost as GPU. Just because they built something doesn't mean it's cheaper. If it is they can rent it cheaper to save money on paying billions of dollars to Nvidia.

    A big segment of the market just uses GPU/TPU to train LLMs, so they don't exactly need flexibility if some tool is well supported.

    • querez 3 days ago

      I assume TPU TCO is significantly cheaper than GPU TCO. At the same time, I also assume that market demand for GPUs is higher than TPUs (external tooling is just more suited to GPU -- e.g. I'm not sure what the Pytorch-on-TPU story is these days, but I'd be astounded if it's on par with their GPU support). So moving all your internal teams to TPUs means that all the GPUs can be allocated to GCP.

      • YetAnotherNick 3 days ago

        Just doesn't make sense. If you make significantly more money renting TPU, why not rent them cheaper to shift the customers(and save billions that you are giving to Nvidia). TPU right now isn't significantly more cheaper to external customer.

        Again I am talking about LLM training/inference which if I were to guess is more than half of the workload currently for which the switching cost is close to 0.

  • hansvm 4 days ago

    At least blessed teams we used GPUs when we were allowed, else CPUs. TPUs were basically banned in YT since they were reserved for higher priority purposes. Gemini was almost certainly trained with one, but I guarantee an ungodly amount of compute has gone into training neural nets with CPUs and GPUs.

Zababa 4 days ago

>To get to the point of executing a successful training run like that, you have to count every failed experiment and experiment that gets you to the final training run.

I get the sentiment, but then, do you count all the other experiments that were done by that company before specifically trying to train this model? All the experiments done by people in that company at other companies? Since they rely on that experience to train models.

You could say "count everything that has been done since the last model release", but then for the same amount of effort/GPU, if you release 3 models does that divide each model cost by 3?

Genuinely curious in how you think about this, I think saying "the model cost is the final training run" is fine as it seems standard ever since DeepSeek V3, but I'd be interested if you have alternatives. Possibly "actually don't even talk about model cost as it will always be misleading and you can never really spend the same amount of money to get the same model"?

maziyar 3 days ago

i think it's very flattering to have done something with $20m that is so good people think it must have been a $100m!