Comment by mota7

Comment by mota7 9 hours ago

3 replies

Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?

godelski 7 hours ago

You're asking a really good question but it's not a question with an easy answer.

There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.

To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?

On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.

Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.

I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.

  • saagarjha 3 hours ago

    No, GPUs are a lot simpler. You can mostly just take the clock rate and scale it directly for the instruction being compared.

saagarjha 3 hours ago

There's a 2x performance hit from the weird restriction on fp32 accumulation, plus the fact that 5090 has "fake" Blackwell (no tcgen05) which limits the size and throughput of matrix multiplication through the tensor cores.