Comment by adrian_b
M4 has ">2x better performance per watt" than either Intel or AMD only in single-threaded applications or applications with only a small number of active threads, where the advantage of M4 is that it can reach the same or a higher speed at a lower clock frequency (i.e. the Apple cores have a higher IPC).
For multithreaded applications, where all available threads are active, the advantage in performance per watt of Apple becomes much lower than "2x" and actually much lower than 1.5x, because it is determined mostly by the superior CMOS manufacturing process used by Apple and the influence of the CPU microarchitecture is small.
While the big Apple cores have a much better IPC than the competition, i.e. they do more work per clock cycle so they can use lower clock frequencies, therefore lower supply voltages, when at most a few cores are active, the performance per die area of such big cores is modest. For a complete chip, the die area is limited, so the best multithreaded performance is obtained with cores that have maximum performance per area, so that more cores can be crammed in a given die area. The cores with maximum performance per area are cores with intermediate IPC, neither too low, nor too high, like ARM Cortex-X4, Intel Skymont or AMD Zen 5 compact. The latter core from AMD has a higher IPC, which would have led to a lower performance per area, but that is compensated by its wider vector execution units. Bigger cores like ARM Cortex-X925 and Intel Lion Cove have very poor performance per area.