Comment by dragontamer

Comment by dragontamer a year ago

Not... quite. I think you've got the cause-and-effect backwards.

Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.

Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.

Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.

--------

> Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

Power-efficiency is going to be incredibly difficult moving forward.

It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.

E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.

If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.

If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.

delusional a year ago

> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Is this true? In most of my work I'd usually rather have a single serializable thread of execution. Any parallelism usually comes with added overhead of synchronization, and added mental overhead of having to think about parallel execution. If I could freely pick between 4 IPC worth of single core or 1 IPC per core with 4 cores I'd pretty much always pick a single core. The catch is that we're usually not trading like for like. Meaning I can get 3 IPC worth of single core or 4 IPC spread over 4 cores. Now I suddenly have to weigh the overhead and analyze my options.

Would you ever rather have multiple cores or an equivalent single core? Intuitively it feels like there's some mathematics here.

Reply View 6 replies

dzaima a year ago

Indeed a single thread is most simple to reason about, but if you have a single task that can already use 2 cores uniformly, going to 8 cores (assuming enough workload) should be a pretty clean 4x speedup (as long as you don't run into memory bandwidth limits, but that'd cap the single-threaded code too).
But the performance difference between E-core and P-core perf is way less than 4x; the OP article shows a 1.6x/1.7x difference in SPEC for skymont vs lion cove, and 1.3x/1.7x for crestmont vs redwood code; and some searching around for past generations gives numbers around 1.4x.
Increasing core counts being a much more area- and energy-efficient way for hardware to provide more total performance than making the individual cores faster is a pretty fundamental thing.

Reply View | 0 replies
magicalhippo a year ago

For stuff like path tracing you have to work very hard not to trash the caches, so you're often just waiting for memory.
That's why such workloads gets a near linear scaling when using hyper-threads, unlike workloads like LLMs which are memory bandwidth bound.

Reply View | 0 replies
dragontamer a year ago

E-cores can typically execute ~4-instructions per clock tick in highly optimized code.
P-cores go up to like... 6-instructions. Better yes, but not dramatically better. The real issue is that P-cores have far more resources than E-cores: deeper reorder buffers to perform more out-of-order execution. Deeper branch prediction, more register files, larger caches, etc. etc.
So P-cores should be hitting the max of 6-instructions per clock tick on more kinds of code. E-cores have much smaller caches (and other resources) so they'll run out and start stalling out to memory-limitations, which is like 0.1 instructions per clock tick or slower.
----------
But guess what? If a fancy P-core is memory-bound (like a lot of Blender code, due to the large-scale dozens+ GBs nature of modern 3d scenes), then those fancy P-cores run out of resources and are 0.1 IPC as well.
If both P-cores and E-cores are stalled out waiting on memory, you'd rather have 32x E-Cores all executing at 0.1 IPC, rather than only 8x P-cores executing at 0.1 IPC.
Its going to be a complex world moving forward: modern CPUs are growing far more complex and its not clear what the tradeoffs will be. But this reality of E-core and P-cores stalling out and waiting on memory is just how modern code works in too many cases.
And remember, its 4x E-cores are equivalent in area/costs to 1x P-core. So there's no contest in terms of overall instructions-per-second for E-core vs P-cores. The E-cores simply are better, even if the individual threads run slower.

Reply View | 0 replies
adrian_b a year ago

Obviously it is easier to write any program as a single sequential thread, because you do not need to think about the dependencies between program statements. When you append a statement, you assume that all previous statements have been already executed, so the new statement can access without worries any data it needs.
The problem is that the speed of a single thread is limited and there exists no chance to increase it by significant amounts.
As long as we will continue to use silicon, there will be negligible increases in clock frequency. Switching to other semiconductors might bring us a double clock frequency in 10 years from now, but there will never be again a decade like that from 1993 to 2003, when the clock frequencies have increased 50 times.
The slow yearly increase in instructions per clock cycle is obtained by making the hardware do more and more of the work that has not been done by the programmer or the compiler, i.e. by extracting from the sequential program the separate chains of dependent instructions that should have been written as distinct threads, in order to execute them concurrently.
This division of a single instruction sequence into separate threads is extremely inefficient when done at runtime by hardware. Because of this the CPU cores with very high IPC have lower and lower performance per area and per power with the increase of the IPC. Low performance per area and per power means low multithreaded performance.
So the CPU cores with very good single-threaded performance, like Intel Lion Cove or Arm Cortex-X925 have very poor multi-threaded performance and using many of them in a CPU would be futile, because in the same limited area one could put many more small CPU cores, achieving a much higher total performance.
This is why such big CPU cores that are good for single-threaded applications must be paired with smaller CPU cores, like Intel Skymont or Arm Cortex-X4, in order to obtain a good multi-threaded performance.
Writing the program as a single thread is easy and of course it should always be done so when the achieved performance is good enough on the current big superscalar CPU cores.
On the other hand, whenever the performance is not sufficient, there is no way to increase it a lot otherwise than by decomposing the work that must be done into multiple concurrent activities.
The easy case is that of iterations, which frequently provide large amounts of work that can be done concurrently. Moreover, with iterations there are many tools that can create concurrent threads automatically, like OpenMP or NVIDIA CUDA.
Where there are no iterations, one may need to do much more work to identify the dependencies between activities, in order to determine which may be executed concurrently, because they do not have functional dependencies between them.
However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
Nevertheless even in such cases one can question whether the specification of the program is truly what the end user needs. For instance, when computing a hash over a file, the actual goal is normally not the computation of the hash, but to verify whether the file is the same as another (where the other file may be a past version of the same file, to detect modification, or an apparently distinct file coming from another source, when deduplication is desired). In such cases, it does not really matter which hash function is used, so it may be acceptable to replace the hash algorithm with another that allows concurrent computation, solving the performance problem.
Similar reformulations of the problem that must be solved may help in other cases where initially it appears that it is not possible to decompose the workload into concurrent tasks.

Reply View | 2 replies
- clavigne a year ago
  
  > However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.
  Even in that case, you would probably benefit from having many cores because the user is probably running other things on the same machine, or the program is running on a runtime with eg garbage collector threads etc. I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
  
  Reply View | 1 reply
  
  dragontamer a year ago
  
  > I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!
  But that happens all the time in video game code.
  Video games may have many threads running, but there's usually a single-thread bottleneck. To the point that P-cores and massively huge Zen5 cores are so much better for video games.
  Javascript (ie: rendering webpages) is single-threaded bound, which is probably why the Phone makers have focused so much on making bigger cores as well. Yes, there's plenty of opportunities for parallelism in web browsers and webpages. But most of the work is in the main Javascript thread called at the root.
  
  Reply View | 0 replies

rasz a year ago

>A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Nah, Blender programmer will prefer one core with AVX-512 instead of 4 without it.

Reply View 4 replies

dkjaudyeqooe a year ago

That's just more parallelism, they'll take their parallelism wherever they can get it.
It's to be seen if the future is more SIMD or more smaller general processors. Arguably the latter are more flexible but maybe not as efficientas the former.

Reply View | 0 replies
dragontamer a year ago

I mean, eventually yeah.
I like Zen5 as much as the next guy, but it should be noted that even today's most recent version of Blender is AVX (256-bit) only. That means E-cores remain the optimal core to work with for a lot of Blender stuff.
Hopefully AMD Zen5 AVX512 becomes more popular. Maybe it'd become more popular as Intel rolls out AVX10 (somewhat compatible instruction set)

Reply View | 2 replies
- adgjlsfhk1 a year ago
  
  Would blender benefit from the bits of AVX-512 other than the width? I would think the approximate sqrt instructions might be useful.
  
  Reply View | 1 reply
  
  dragontamer a year ago
  
  AVX512 is one of the best instruction sets I've seen. No joke.
  There's all kinds of things AVX512 would help out in Blender. But those ways are incompatible with older AVX2 or SSE code. The question is if Blender will be willing to support SSE, AVX, and AVX512 code paths. Each new codepath is more maintenance and more effort.
  AVX512 has more registers: not just 32x 512-bit registers (AVX normally has 16x 256-bit registers), but also the kmask registers (64-bits that take the place of old boolean logic that used to be done on the 256-bit registers). This alone should give far more optimizations for the compiler to automatically find.
  There's also VPCOMPRESSB and VPEXPANDB, Conflict-detection, and other instructions that make new SIMD data-structures far more efficient to implement. But this requires deep understanding that very few programmers have yet.
  
  Reply View | 0 replies

seanmcdirmid a year ago

> A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

Don’t they really want GPU threads for that? You wouldn’t get by with just weaker cores.

Reply View 3 replies

dragontamer a year ago

Cloth physics in Blender are stored in RAM (as scenes and models can grow very large, too large for a GPU).
Figuring out which verticies for a physics simulation need to be sent to the GPU would be time, effort, and PCIe traffic _NOT_ running the cloth physics.
Furthermore, once all the data is figured out and blocked out in cache, its... cached. Cloth physics only interacts with a small number of close, nearby objects. Yeah, you _could_ calculate this small portion and send it to the GPU, but CPU is really good at just automatically traversing trees and storing the most recently used stuff in L1, L2, and L3 caches automatically (without any need of special code).
All in all, I expect something like Cloth physics (which is a calculation Blender currently does on CPU-only), is best done CPU only. Not because GPUs are bad at this algorithm... but instead because PCIe transfers are too slow and cloth physics is just too easily cached / benefited by various CPU features.
It'd be a lot of effort to translate all that code to GPU and you likely won't get large gains (like Raytracing/Cycles/Rendering gets for GPU Compute).

Reply View | 2 replies
- seanmcdirmid a year ago
  
  NVIDIA's physX has its own cloth physics abstractions: https://docs.nvidia.com/gameworks/content/gameworkslibrary/p..., so I'm sure it is a thing we do on GPUs already, if only for games. These are old demos anyways:
  https://www.youtube.com/watch?v=80vKqJSAmIc
  I wonder what the difference is between the cloth physics you are talking about and the one NVIDIA has been doing for I think more than a decade now? Is it scale? It sounds like, at least, there are alternatives that do it on the GPU and there are questions if Blender will do it on the GPU:
  https://blenderartists.org/t/any-plans-to-make-cloth-simulat...
  
  Reply View | 1 reply
  
  dragontamer a year ago
  
  Cloth / Hair physics in those games were graphics-only physics.
  They could collide with any mesh that was inside of the GPU's memory. But those calculations cannot work on any information stored on CPU RAM. Well... not efficiently anyway.
  ---------
  When the Cloth simulator in Blender runs, it generates all kinds of information the CPU needs for other steps. In effect, Blender's cloth physics serves as an input to animation frames, which is all CPU-side information.
  Again: i know cloth physics executes on GPUs very well in isolation. But I'd be surprised if BLENDER's specific cloth physics would ever be efficient on a GPU. Because as it turns out, calculations kind of don't matter in the big-picture. There's a lot of other things you need to do after those calculations (animations, key frames, and other such interactions). And if all that information is stored randomly in 100GB of CPU RAM, it'd be very hard to untangle that data and get it to a GPU (and back).
  In a Video Game PHYSX setting, you just display the cloth physics to the screen. In Blender, a 3d animation program, you have to do a lot more with all that information and touch many other data-structures.
  PCIe is very slow compared to RAM.
  
  Reply View | 0 replies