Comment by ThatMedicIsASpy

Comment by ThatMedicIsASpy 8 days ago

23 replies

All-around winner in what? For $500 you can get a lot more cores.

All-around winning, $500, 8 cores makes no sense.

This thing has a premium gaming price tag because there is nothing close to it other than their own 7800X3D.

sliken 8 days ago

In theory, yes. But in the real world the bottleneck of the same 128 bit wide memory, interface that's been popular way back since the time of dual core chips.

Less cache misses (on popular workloads) helps decrease power and increase performance enough that few things benefit from 12-16 cores.

Thus the M3 max (with a 512 bit wide memory system) has a class leading single core and multi-core scores.

  • 0xQSL 8 days ago

    I'm not so sure about memory actually being the bottleneck for these 8 core parts. If memory bandwidth is the bottleneck this should show up in benchmarks with higher dram clocks. I can't find any good application benchmarks, but computerbase.de did it for gaming with 7800MHz vs 6000MHz and didn't find much of a difference [1]

    The apple chips are APUs and need a lot of their memory bandwidth for the gpu. Are there any good resources on how much of this bandwidth is actually used in common cpu workloads? Can the CPU even max out half of the 512bit bus?

    [1] https://www.computerbase.de/artikel/prozessoren/amd-ryzen-7-...

    • sliken 8 days ago

      Well there's much more to memory performance than bandwidth. Generally applications are relatively cache friendly, thus the X3D helps a fair bit, especially with more intensive games (ones that barely hit 60 fps, not the silly game benchmarks that hit 500 fps).

      Generally CPUs have relatively small reorder windows, so a cache miss hurts bad, 80ns latency @ 5 GHz is 400 clock cycles, and something north of 1600 instructions that could have been executed. If one in 20 operations is a cache miss that's a serious impediment to getting any decent fraction of peak performance. The pain of those cache misses is part of why the X3D does so well, even a few less cache misses can increase performance a fair bit.

      With 8c/16 threads having only 2 (DDR4) or 4 (DDR5) cache misses pending with a 128 bit wide system means that in any given 80-100ns window only 2 or 4 cores can continue resume after a cache miss. DDR-6000 vs DDR-7800 doesn't change that much, you still wait the 80-100ns, you just get the cache line in 8 (16 for ddr5) cycles @ 7800MT/sec instead of 8 (16 for DDR5) cycles @ 6000MT/sec. So the faster DDR5 means more bandwidth (good for GPUs), but not more cache transactions in flight (good for CPUs).

      With better memory systems (like the Apple m3 max) you could have 32 cache misses per 80-100ns. I believe about half of those are reserved for the GPU, but even 16 would mean that all of the 9800X3Ds 16 threads could resolve a cache miss per 80-100ns instead of just 2 or 4.

      That's part of why a M4 max does so well on multithreaded code. M4 max does better on geekbench 6 multithread than not only the 9800x3d (with 16 threads) but also a 9950x (with 16c/32 threads). Pretty impressive for a low TDP chip that fits in thin/light laptop with great battery life and competes well against Zen 5 chips with a 170 watt TDP that often use water cooling.

      • Dylan16807 7 days ago

        > only 2 (DDR4) or 4 (DDR5) cache misses pending with a 128 bit wide system

        Isn't that the purpose of banks and bank groups, letting a bunch of independent requests work in parallel on the same channel?

        • sliken 7 days ago

          Dimms are dumb. Not sure, but maybe rambus helped improve this. Dimms are synchronous and each memory channel can have a single request pending. So upon a cache miss on the last level cache (usually L3) you send a row, column, wait 60ns or so, then get a cache line back. Each memory channel can only have a single memory transaction (read or write) in flight. The memory controller (usually sitting between the L3 and ram) can have numerous cache misses pending, each waiting for the right memory channel to free.

          There are minor tweaks, I believe you can send a row, column, then on future accesses send only the column. There's also slight differences in memory pages (a dimm page != kernel page) that decrease latency with locality. But the differences are minor and don't really move the needle on main memory latency of 60 ns (not including the L1/l2/l3 latency which have to miss before getting to the memory controller).

          There are of course smarter connections, like AMD's hypertransport or more recently infinity fabric (IF) that are async and can have many memory transactions in flight. But sadly the dimms are not connected to HT/IF. IBM's OMI is similar, fast async serial interface, with an OMI connection to each ram stick.

    • [removed] 7 days ago
      [deleted]
    • wmf 8 days ago

      For AMD I think Infinity Fabric is the bottleneck so increasing memory clock without increasing IF clock does nothing. And it's also possible that 8 cores with massive cache simply don't need more bandwidth.

      • sliken 8 days ago

        My understanding is the single CCD chips (like the 9800x3d) have 2 IF links, while the dual CCD chips (like the 9950x) have 1. Keep in mind these CCDs are shared with turin (12 channel), threadripper pro (8 channel), siena (6 channel), threadripper (4 channel).

        The higher CCD configurations have 1 IF link per chip, the lower have 2 IF links per chip. Presumably AMD would bother with the 2 IF link chips unless it helped.

jandrese 8 days ago

The benchmarks in the article suggest that more cores are largely wasted on real world applications.

  • ThatMedicIsASpy 8 days ago

    Yes so buy according to your needs? 8 cores do not cost $500.

    • behringer 8 days ago

      They do when those cores are 2 to 4 times faster than the rest.