Comment by dragontamer
Comment by dragontamer 14 hours ago
Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.
Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.
I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).
Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.
This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.
It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.
> The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO
Not just small loops. It can reach 9x instruction decode on almost any control flow pattern. It just looks at the next 3 branch targets and starts decoding at each of them. As long as there is a branch every 32ish instructions (presumably a taken branch?), Skymont can keep all three uop queues full and Rename/dispatch can consume uops at a sustained rate of 8 uops per cycle.
And in typical code, blocks with more than 32 instructions between branches are somewhat rare.
But Skymont has a brilliant trick for dealing with long runs of branchless code too: It simply inserts dummy branches into the branch predictor, breaking them into shorter blocks that fit into the 32 entry uop queues. The 3 decoders will start decoding the long block at three different positions, leap-frogging over each-other until the entire block is decoded and stuffed into the queues.
This design is absolutely brilliant. It seems to entirely solve the issue decoding X86, with far less resources than a uop cache. I suspect the approach will scale to almost unlimited numbers of decoders running in parallel, shifting the bottlenecks to other parts of the design (branch prediction and everything post decode)