Comment by dragontamer

E-cores can typically execute ~4-instructions per clock tick in highly optimized code.

P-cores go up to like... 6-instructions. Better yes, but not dramatically better. The real issue is that P-cores have far more resources than E-cores: deeper reorder buffers to perform more out-of-order execution. Deeper branch prediction, more register files, larger caches, etc. etc.

So P-cores should be hitting the max of 6-instructions per clock tick on more kinds of code. E-cores have much smaller caches (and other resources) so they'll run out and start stalling out to memory-limitations, which is like 0.1 instructions per clock tick or slower.

----------

But guess what? If a fancy P-core is memory-bound (like a lot of Blender code, due to the large-scale dozens+ GBs nature of modern 3d scenes), then those fancy P-cores run out of resources and are 0.1 IPC as well.

If both P-cores and E-cores are stalled out waiting on memory, you'd rather have 32x E-Cores all executing at 0.1 IPC, rather than only 8x P-cores executing at 0.1 IPC.

Its going to be a complex world moving forward: modern CPUs are growing far more complex and its not clear what the tradeoffs will be. But this reality of E-core and P-cores stalling out and waiting on memory is just how modern code works in too many cases.

And remember, its 4x E-cores are equivalent in area/costs to 1x P-core. So there's no contest in terms of overall instructions-per-second for E-core vs P-cores. The E-cores simply are better, even if the individual threads run slower.