Comment by smy20011

Comment by smy20011 2 days ago

43 replies

Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.

I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

mike_hearn 2 days ago

We don't actually know how much money DeepSeek spent or how much compute they used. The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

In any AI R&D operation the bulk of the compute goes on doing experiments, not on the final training run for whatever models they choose to make available.

  • wallaBBB 2 days ago

    One thing I (intuitively) don't doubt - that they spent less money for developing R1 than OpenAI spent on marketing, lobbying and management compensation.

    • pertymcpert 2 days ago

      What makes you say that? Do you think Chinese top tier talent is cheap?

      • wallaBBB 2 days ago

        I did not refer to the talent directly contributing to the technical progress.

        P.S. - clarification: I mean not referring to talent at OpenAI. And yes I have very little doubt talent at DeepSeek is a lot cheaper than the things I listed above for OpenAI. I would be interested in a breakdown of the cost of OpenAI and seeing if even their technical talent costs more than the things I mentioned.

        • pertymcpert 2 days ago

          Do you think 1.5M a year compensation is cheap? That’s in the range of OpenAI offers.

      • anonzzzies 2 days ago

        What is cheap? But compared to the US, yes. Almost everywhere talent is 'cheap' compared to the US unless they move to the US.

        • pertymcpert 2 days ago

          How experienced are you with Chinese AI talent compensation?

      • victorbjorklund 2 days ago

        I'm sure the salaries at Deepseek in China were lower than the salaries at OpenAI.

      • amunozo 2 days ago

        Definitely cheaper than American top tier talent

        • pertymcpert 2 days ago

          How much cheaper? I’m curious because I’ve seen the offers that Chinese tech companies pay and it’s in the millions for the top talent.

  • tw1984 2 days ago

    > The numbers being thrown around are suspect, the paper they published didn't reveal the costs of all models nor the R&D cost it took to develop them.

    did any lab release such figure? will be interesting to see.

sigmoid10 2 days ago

>It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

The article explains how in reality the opposite is true. Especially when you look at it long term. Compute power grows exponentially, humans do not.

  • llm_trw 2 days ago

    If the bitter lesson were true we'd be getting sota results out of two layer neural networks using tanh as activation functions.

    It's a lazy blog post that should be thrown out after a minute of thought by anyone in the field.

    • sigmoid10 a day ago

      That's not how the economics work. There has been a lot of research that showed how deeper nets are more efficient. So if you spend a ton of compute money on a model, you'll want the best output - even though you could just as well build something shallow that may well be state of the art for its depth, but can't hold up with the competition on real tasks.

      • llm_trw a day ago

        Which is my point.

        You need a ton of specialized knowledge to use compute effectively.

        If we had infinite memory and infinite compute we'd just throw every problem of length n to a tensor of size R^(n^n).

        The issue is that we don't have enough memory in the world to store that tensor for something as trivial as mnist (and won't until the 2100s). And as you can imagine the exponentiated exponential grows a bit faster than the exponential so we never will.

        • sigmoid10 a day ago

          Then how does this invalidate the bitter lesson? It's like you're saying if aerodynamics were true, we'd have planes flying like insects by now. But that's simply not how it works at large scales - in particular if you want to build something economical.

  • OtherShrezzing 2 days ago

    Humans don't grow exponentially indefinitely. But there's only something in the order of 100k AI researchers employed in the big labs right now. Meanwhile, there's around 20mn software engineers globally, and around 200k math graduates per year.

    The number of humans who could feasibly work on this problem is pretty high, and the labs could grow an order of magnitude, and still only be tapping into the top 1-2% of engineers & mathematicians. They could grow two orders of magnitude before they've absorbed all of the above-average engineers & mathematicians in the world.

    • sigmoid10 2 days ago

      I'd actually say the market is stretched pretty thin by now. I've been an AI researcher for a decade and what passes as AI researcher or engineer these days is borderline worthless. You can get a lot of people who can use scripts and middleware like frontend lego sets to build things, but I'd say there are less than 1k people in the world right now who can actually meaningfully improve algorithmic design. There are a lot more people out there who do systems design and cloud ops, so only when you choose to go for scaling, you'll find a plentiful set of human brainpower.

      • llm_trw 2 days ago

        Do you know what places people who are interested in research congregate at? Every forum, meet up or journal gets overwhelmed by bullshit with a year of being good.

  • smy20011 2 days ago

    Human do write code that scalable with compute.

    The performance is always raw performance * software efficiency. You can use shitty software and waste all these FLOPs.

  • alecco 2 days ago

    Algorithmic improvements in new fields are often bigger than hardware improvements.

stpedgwdgfhgdd 2 days ago

Large amounts of teams are very hard to scale.

There is a reason why startups innovate and large companies follow.

mirekrusin 2 days ago

Deepseek innovation is applicable to xAI setup - results are simply multiply of their compute scale.

Deepseek didn’t have option A or B available, they only had extreme optimisation option to work with.

It’s weird that people present those two approaches as mutually exclusive ones.

PeterStuer 2 days ago

It's not an either/or. Your hiring of talent is only limited by your GPU spend if you can't hire because you ran out of money.

In reality pushing the frontier on datacenters will tend to attract the best talent, not turn them away.

And in talent, it is the quality rather than the quantity that counts.

A 10x breakthrough in algorithm will compound with a 10x scaleout in compute, not hinder it.

I am a big fan of Deepseek, Meta and other open model groups. I also admire what the Grok team is doing, especially their astounding execution velocity.

And it seems like Grok 2 is scheduled to be opened as promised.

  • smy20011 2 days ago

    Not that simple, It could cause resource curse [1] for developers. Why optimize algorithm when you have nearly infinity resources? For deepseek, their constrains is one of the reason they achieve breakthrough. One of their contribution, fp8 training, is to find a way to train models with GPUs that limit fp32 performance due to export control.

    [1]: https://www.investopedia.com/terms/r/resource-curse.asp#:~:t...

  • krainboltgreene 2 days ago

    Have fun hiring any talent after three years of advertising to students that all programming/data jobs are going to be obsolete.

SamPatt 2 days ago

R1 came out when Grok 3's training was still ongoing. They shared their techniques freely, so you would expect the next round of models to incorporate as many of those techniques as possible. The bump you would get from the extra compute occurs in the next cycle.

If Musk really can get 1 million GPUs and they incorporate some algorithmic improvements, it'll be exciting to see what comes out.

dogma1138 2 days ago

Deepseek didn’t seem to invest in talent as much as it did in smuggling restricted GPUs into China via 3rd countries.

Also not for nothing scaling compute x100 or even x1000 is much easier than scaling talent by x10 or even x2 since you don’t need workers you need discovery.

  • tw1984 2 days ago

    Talent is not something you can just freely pick up from your local Walmart.

oskarkk 2 days ago

> While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.

I'm not sure if it's close to 100x more. xAI had 100K Nvidia H100s, while this is what SemiAnalysis writes about DeepSeek:

> We believe they have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth.

> We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months. These GPUs are shared between High-Flyer and DeepSeek and geographically distributed to an extent. They are used for trading, inference, training, and research. For more specific detailed analysis, please refer to our Accelerator Model.

> Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location.

https://semianalysis.com/2025/01/31/deepseek-debates/

I don't know how much slower are these GPUs that they have, but if they have 50K of them, that doesn't sound like 100x less compute to me. Also, a company that has N GPUs and trains AI on them for 2 months can achieve the same results as a company that has 2N GPUs and trains for 1 month. So DeepSeek could spend a longer time training to offset the fact that have less GPUs than competitors.

  • cma 2 days ago

    Having 50K of them isn't the same thing as 50K in one high bandwidth cluster, right? x.AI has all theirs so far in one connected cluster, and all of homogenous H100s right?

wordofx 2 days ago

Deepseek was a crypto mining operation before they pivoted to AI. They have an insane amount of GPUs laying around. So we have no idea how much compute they have compared to xAI.

  • oskarkk 2 days ago

    Do you have any sources for that? When I searched "DeepSeek crypto mining" the first result was your comment, the other results were just about the wide tech market selloff after DeepSeek appeared (that also affected crypto). As far as I know, they had many GPUs because their parent company was using AI algorithms for trading for many years.

    https://en.wikipedia.org/wiki/High-Flyer

    • wordofx 2 days ago

      You know crypto mining is illegal in China right? Of course they avoid mentioning it. Discussion boards in China had ex employees mention doing crypto mining but it’s all been wiped.

  • miki123211 2 days ago

    Crypto GPUs have nothing to do with AI GPUs.

    Crypto mining is an embarassingly parallel problem, requiring little to no communication between GPUs. To a first approximation, in crypto, 10x-ing the amount of "cores" per GPU, 10x-ing the number of GPUs per rig and 10X-ing the number of rigs you own is basically equivalent. An infinite amount of extremely slow GPUs would do just as well as one infinitely fast GPU. This is why consumer GPUs are great for crypto.

    AI is the opposite. In AI, you need extremely fast communication between GPUs. This means getting as much memory per GPU as possible (to make communication less necessary), and putting all the GPUs all in one datacenter.

    Consumer GPUs, which were used for crypto, don't support the fast communication technologies needed for AI training, and they don't come in the 80gb memory versions that AI labs need. This is Nvidia's price differentiation strategy.