Comment by cmiles8

Comment by cmiles8 16 hours ago

19 replies

AWS keeps making grand statements about Trainium but not a single customer comes on stage to say how amazing it is. Everyone I talked to that tries it says there were too many headaches and they moved on. AWS pushes it hard but “more price performant” isn’t a benefit if it’s a major PITA to deploy and run relative to other options. Chips without a quality developer experience isn’t gonna work.

Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.

phamilton 15 hours ago

The inf1/inf2 spot instances are so unpopular that they cost less than the equivalent cpu instances. Exact same (or better) hardware but 10-20% cheaper.

We're not quite seeing that on the trn1 instances yet, so someone is using them.

  • kcb 13 hours ago

    Heh, I was looking at an eks cluster recently that was using Cast AI autoscalar. Scratching my head as there was a bunch of inf instances. Then I realized it must be cheap spot pricing.

giancarlostoro 16 hours ago

Not just AWS, looks like Anthropic uses it heavily as well. I assume they get plenty of handholding from Amazon though. I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.

  • cmiles8 16 hours ago

    Well AWS says Anthropic uses it but Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.

    If Anthropic walked out on stage today and said how amazing it was and how they’re using it the announcement would have a lot more weight. Instead… crickets from Anthropic in the keynote

    • cobolcomesback 14 hours ago

      AWS has built 20 data centers in Indiana full of half a million Trainium chips explicitly for Anthropic. Anthropic is using them heavily. The same press announcement that Anthropic has made about Google TPUs is the exact same one they made a year ago about Trainium. Hell, even in the Google TPU press release they explicitly mention how they are still using Trainium as well.

      • VirusNewbie 14 hours ago

        Can you link to the press releases? The only one I'm aware of by Anthropic says they will use Tranium for future LLMs, not that they are using them.

    • hustwindmaple 10 hours ago

      I met a AWS engineer a couple of weeks ago and he said Trainium is actually being used for Anthropic model inference, not for training. Inferentia is basically defected Trainiums chips that nobody wants to use.

    • teruakohatu 16 hours ago

      > Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.

      You can’t really read into that. They are unlikely to let their competitors know if they have a slight performance/$ edge by going with AWS tech.

      • cmiles8 15 hours ago

        With GCP announcing they built Gemini 3 on TPUs the opposite is true. Anthropic is under pressure to show they don’t need expensive GPUs. They’d be catching up at this point, not leaking some secret sauce. No reason for them to not boast on stage today unless there’s nothing to boast about.

  • IshKebab 14 hours ago

    > I'm surprised any cloud provider does not invest drastically more into their SDK and tooling

    I used to work for an AI startup. This is where Nvidia's moat is - the tens of thousands of man-hours that has gone into making the entire AI ecosystem work well with Nvidia hardware and not much else.

    It's not that they haven't thought of this, it's just that they don't want to hire another 1k engineers to do it.

  • logicchains 13 hours ago

    >I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.

    Building an efficient compiler from high-level ML code to a TPU is actually quite a difficult software engineering feat, and it's not clear that Amazon has the kind of engineering talent needed to build something like that. Not like Google which have developed multiple compilers and language runtimes.

  • [removed] 14 hours ago
    [deleted]