Comment by phire

Comment by phire 6 months ago

> The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO

Not just small loops. It can reach 9x instruction decode on almost any control flow pattern. It just looks at the next 3 branch targets and starts decoding at each of them. As long as there is a branch every 32ish instructions (presumably a taken branch?), Skymont can keep all three uop queues full and Rename/dispatch can consume uops at a sustained rate of 8 uops per cycle.

And in typical code, blocks with more than 32 instructions between branches are somewhat rare.

But Skymont has a brilliant trick for dealing with long runs of branchless code too: It simply inserts dummy branches into the branch predictor, breaking them into shorter blocks that fit into the 32 entry uop queues. The 3 decoders will start decoding the long block at three different positions, leap-frogging over each-other until the entire block is decoded and stuffed into the queues.

This design is absolutely brilliant. It seems to entirely solve the issue decoding X86, with far less resources than a uop cache. I suspect the approach will scale to almost unlimited numbers of decoders running in parallel, shifting the bottlenecks to other parts of the design (branch prediction and everything post decode)

rayiner 6 months ago

Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.

Reply View 2 replies

dragontamer 6 months ago

The important bit: Intel E-cores now have 3x decoders each with the ability for 3-wide decode. When they work as a team, they can perform 9 decodes per clock tick (which then bottlenecks to 8 renamed uops in the best case scenario, and more than likely ~4 or ~3 more typical uops).

Reply View | 1 reply
- phire 6 months ago
  
  3-4 uops per cycle is more of an average throughput than a typical throughput.
  The average is dragged down by many cycles that don't decoded/rename any uops. Either waiting for bytes to decode (icache miss, etc) or rename is blocked because the ROB is full (probably stalled on a dcache miss).
  So you want a quite wide frontend so that whenever you are unblocked, you can drag the average up again.
  
  Reply View | 0 replies