Comment by rwmj

Comment by rwmj 13 hours ago

10 replies

Slightly off topic, but if I'm aiming to get the fastest 'make -jN' for some random C project (such as the kernel) should I set N = #P [threads] + #E, or just the #P, or something else? Basically, is there a case where using the E cores slows a compile down? Or is power management a factor?

I timed it on the single Intel machine I have access to with E-cores and setting N = #P + #E was in fact the fastest, but I wonder if that's a general rule.

adrian_b 12 hours ago

On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.

On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.

For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.

For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.

Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.

EDIT: I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.

Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.

So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.

Only with SMT disabled an extra thread over the number of cores becomes beneficial.

The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.

wmf 12 hours ago

Power management is a factor because the cores "steal" power from each other. However the E-cores are more efficient so slowing down P-cores and giving some of their power to the E-cores increases the efficiency and performance of the chip overall. In general you're better off using all the cores.

  • jeffbee 11 hours ago

    I suggest this depends on the exact model you are using. On Alder/Raptor Lake, the E-cores run at 4.5GHz which is completely futile, and in doing so they heat their adjacent P-cores, because 2x E-core clusters can easily draw 135W or more. This can significantly cut into the headroom of the nearby P. Arrow Lake-S has rearranged the E-cores.

saurik 13 hours ago

Did you test at least +1 if not *1.5 or something? I would expect you to occasionally get blocked on disk I/O and would want some spare work sitting hot to switch in.

  • rwmj 13 hours ago

    Let me test that now. Note I only have 1 Intel machine so any results are very specific to this laptop.

      -j           time (mean ± σ)
      12 (#P+#E)   130.889 s ±  4.072 s
      13 (..+1)    135.049 s ±  2.270 s
       4 (#P)      179.845 s ±  1.783 s
       8 (#E)      141.669 s ±  3.441 s
    
    Machine: 13th Gen Intel(R) Core(TM) i7-1365U; 2 x P-cores (4 threads), 8 x E-cores
    • wtallis 11 hours ago

      Your processor has two P cores, and ten cores total, not twelve. The HyperThreading (SMT) does not make the two P cores into four cores. Your experiment with 4 threads will most likely result in using both P cores and two E cores, as no sane OS would double up threads on the P cores before the E cores were full with one thread each.

      • rwmj 11 hours ago

        The hyperthreading should cover up memory latency, since the workload (compiling qemu) might not fit into L3 cache. Although I take your point that it doesn't magically create two core-equivalents.

        • gonzo 8 hours ago

          “Hyperthreading” is a write pipe hack.

          If the core stalls on a write then the other thread gets run.

      • jeffbee 11 hours ago

        I am sure rwmj was smart enough to use `taskset` to make this experiment meaningful.

        • rwmj 11 hours ago

          Hehe, if only :-( However I do want to know what's best with the default Linux scheduler and just using 'make' rather than more complicated commands.