Comment by smilekzs
Single-chip specs according to:
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/abou...
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/...
Eight NeuronCore-v4 cores that collectively deliver:
2,517 MXFP8/MXFP4 TFLOPS
671 BF16/FP16/TF32 TFLOPS
2,517 FP16/BF16/TF32 sparse TFLOPS
183 FP32 TFLOPS
HBM: 144 GiB HBM @ 4.9 TB/sec (4 stacks)SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)
Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Mathâ„¢)
----
At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).
The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).