Comment by smilekzs

Single-chip specs according to:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/abou...

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/...

Eight NeuronCore-v4 cores that collectively deliver:

    2,517 MXFP8/MXFP4 TFLOPS
    671 BF16/FP16/TF32 TFLOPS
    2,517 FP16/BF16/TF32 sparse TFLOPS
    183 FP32 TFLOPS

HBM: 144 GiB HBM @ 4.9 TB/sec (4 stacks)

SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)

Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Math™)

----

At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).

The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).