Comment by mlyle

Comment by mlyle 6 months ago

> should have highest performance priority so that internally it stays on p cores

Everything will decide that it wants P cores; it's not punished for battery or energy impact, and wants to win over other applications for users to have a better experience with it.

And even if not made in bad faith, it doesn't know what else is running on the system.

Also these decisions tend to be unduly influenced by microbenchmarks and then don't apply to the real system.

> which threads are sharing a lot of memory

But if they're not super active, should the scheduler really change what it's doing? And doesn't the size of that L2 matter? It doesn't matter if e.g. the stuff is going to get churned out before there's a benefit from that sharing.

In the end, if you don't know pretty specific details of the environment you'll run on: what the hardware is like, what loading is like, what data set size is like, and what else will be running on the machine -- it is probably better to leave this decision to the scheduler.

If you do know all those things, and it's worth tuning this stuff in depth-- odds are you're HPC and you know what the machine is like.

Salgat 6 months ago

To clarify, what gets scheduled is up to the OS or runtime, all you're doing is setting relative priority. If everything is all the same priority, then it's just as likely to all run on e cores.

Reply View 7 replies

mlyle 6 months ago

And then, what's the point?
A system that encourage everyone to jack everything up is pointless.
A system to tell the OS that the developer anticipates that data is shared and super hot will be mostly lied to (on accident or purpose).
There's the edge cases: database servers, HPC, etc, where you believe that the system has a sole occupant that can predict loading.
But libnuma, and the underlying ACPI SRAT/SLIT/HMAT tables are a pretty good fit for these use cases.

Reply View | 6 replies
- Salgat 6 months ago
  
  If you lie about the nature of your application, you'll only hurt performance in this configuration. You're not telling the OS what cores to run on, you're simply giving hints as to how the program behaves. It's no different than telling the threadpool manager how many threads to create or if a thread is long lived. It's a platform agnostic hint to help performance. And remember, this is all optional, just like the threadpool example that already exists in most major languages. Are you going to argue that programs shouldn't have access to core count information on the cpu too? They'll just shoot their foots as you said.
  
  Reply View | 5 replies
  
  mlyle 6 months ago
  
  Again, there's already explicit ways for programs to show fine control; this stuff is already declared in ACPI and libnuma and higher level shims exist over it. But generally you want to know both how the entire machine is being used and pretty detailed information about working set sizes before attempting this.
  Most things that have tried to set affinities have ended up screwing it up.
  There's no need to put an easier user interface on the footgun or to make the footgun cross-platform. These interfaces provide opportunities for small wins (generally <5%) and big losses. If you're in a supercomputing center or a hyperscaler running your own app, this is worth it; if you're writing a DBMS that will run on tens of thousands of dedicated machines, it may be worth it. But usually you don't understand the way you'll be employed well enough to know if this is a win.
  
  Reply View | 4 replies