Comment by KaiserPro
Same explanation but with less mysticism:
Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.
I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.
To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.
Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)
> Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines.
I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally.
(There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.)