Comment by redwood
Imagine if OpenAI weren't locked into Azure's stack.
A lot of people seem to think multi-cloud is an unrealistic dream. But using best in class primitives that are available in each cloud is not an unreasonable thing to do.
Imagine if OpenAI weren't locked into Azure's stack.
A lot of people seem to think multi-cloud is an unrealistic dream. But using best in class primitives that are available in each cloud is not an unreasonable thing to do.
Why would they create a bespoke abstraction layer instead of just relying on k8s?
There is only pain on the path of recreating it, it will end up almost as complex as k8s and it will be hell to hire and train for. Best to just use something battle-tested that works with a large pool of people trained for it, even better: their own LLM has gobbled up all the content possible about k8s to help their engineers. K8s complexity came to be for reasons discovered during growing the stack which anyone doing a bespoke similar system might run into, and it's pretty modular since you can pick-and-choose the parts you actually need for your cluster.
Wasting manpower to recreate a bespoke Kubernetes doesn't sound great for a company burning billions per quarter, it's just more waste.
OpenAI is absolutely not locked into the Azure stack. https://en.m.wikipedia.org/wiki/Stargate_LLC
All organizations that are reasonably large and for which cloud costs is a large portion of expenses have an abstraction layer to switch between providers. Otherwise it’s impossible to negotiate better deals, you can’t play multiple cloud providers against each other for a better rate.
OpenAI is locked into _someone's_ compute resources... the one that is the cheapest. AFAIK OpenAI doesn't have much (any?) of their own hardware. With the mega-clouds buying up all the GPUs and building datacentres, you have to 'partner' with someone. Most likely the one that gives you the biggest discounts. The amount of compute that OpenAI needs dwarfs almost any other consideration.