Comment by ktpsns

Comment by ktpsns 2 days ago

5 replies

Typical large scale high performance computing clusters are at a size of 10k nodes (for instance Jupiter and SuperMUC in Germany) [1]. These centers are quite remarkably big buildings. I wonder how much 1M node single k8s clusters there are in the world right now. Most likely at the hyperscalers.

[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.

stackskipton 2 days ago

I doubt any Hyperscalers are running 1M Node clusters either. They probably just have groups of clusters at each datacenter and some overall scheduler that determines which cluster is best suited for workload during deployment then connects to that cluster and schedules the workload.

  • [removed] 9 hours ago
    [deleted]
  • merb 6 hours ago

    Some hyperscalers even have services for that. Which even makes it possible to have cross cluster ingress. And other things. And it makes it possible to have multiple cluster ingress different regions that somewhat work together.

osigurdson 10 hours ago

>> [1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing

I'm sure they mean actual servers / not just cores. Even in traditional HPC it isn't abstracted to the level of individual cores usually since most HPC jobs care about memory bandwidth - even with Infiniband or other techniques throughput / latency is much worse than on a single machine. Of course, multiple machines are connected (usually using MPI / Infiniband) but important to try to minimize communication between nodes where possible.

For AI workloads, they are running GPUs - so 10K+ cores on a single device so even less likely to be talking about cores here.