Comment by neilmovva

Comment by neilmovva 6 hours ago

I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so they are pretty close in performance per dollar.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in ML training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, and is removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 3090 than A100). They are generous with memory bandwidth though, nearly 2TB/s on 5090 is amazing!

mota7 4 hours ago

Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?

Reply View 1 reply

godelski 2 hours ago

You're asking a really good question but it's not a question with an easy answer.
There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.
To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?
On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.
Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.
I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.

Reply View | 0 replies

steinvakt2 6 hours ago

Isn't 5090 FE (roughly 2500 USD in my country) pretty good FLOP value? 32 GB VRAM (and flash attention pushes it even faster compared to apple/mps relatively cheap "vram")

Reply View 3 replies

neilmovva 5 hours ago

Not really:
5090: 210 TF / $2k == 105 TF/$k
B200: 2250 TF / $40k == 56 TF/$k
Getting only 2x the FLOPs per dollar probably isn't worth the hassle of having to rack 10x as many GPUs, while having no NVLink.

Reply View | 2 replies
- steinvakt2 9 minutes ago
  
  Sure, but when spending 20x more, getting almost twice the compute per buck seems expected
  
  Reply View | 0 replies
- lossolo 4 hours ago
  
  One of the reasons they removed NVLink from consumer cards (they supported it before). There’s also an issue with power consumption (1xB200 vs 10x5090)
  
  Reply View | 0 replies

gautamcgoel 5 hours ago

Do you have a source for that B200 price?

Reply View 0 replies

laidoffamazon 4 hours ago

Isn't the new trend to train in lower precision anyway?

Reply View 4 replies

neilmovva 3 hours ago

Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].
What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.
[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

Reply View | 1 reply
- laidoffamazon 2 hours ago
  
  Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway
  
  Reply View | 0 replies
storus 3 hours ago

Only GPU-poors run Q-GaLore and similar tricks.

Reply View | 1 reply
- Twirrim 13 minutes ago
  
  Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.
  
  Reply View | 0 replies