Comment by mlyle

Comment by mlyle 6 months ago

4 replies

Again, there's already explicit ways for programs to show fine control; this stuff is already declared in ACPI and libnuma and higher level shims exist over it. But generally you want to know both how the entire machine is being used and pretty detailed information about working set sizes before attempting this.

Most things that have tried to set affinities have ended up screwing it up.

There's no need to put an easier user interface on the footgun or to make the footgun cross-platform. These interfaces provide opportunities for small wins (generally <5%) and big losses. If you're in a supercomputing center or a hyperscaler running your own app, this is worth it; if you're writing a DBMS that will run on tens of thousands of dedicated machines, it may be worth it. But usually you don't understand the way you'll be employed well enough to know if this is a win.

Salgat 6 months ago

In the context of the future of heterogeneous computing, where your standard pc will have thousands of cores of various capabilities and locality, I very much disagree.

  • mlyle 6 months ago

    > where your standard pc will have thousands of cores

    Thousands of non-GPU cores, intended to run normal tasks? I doubt it.

    Thousands of special purpose cores running different programs like managing power, managing networks, managing RGB lighting around? Maybe, but that doesn't really benefit from this.

    Thousands of cores including GPU cores? What you're talking about in labelling locality isn't sufficient to address this problem, and isn't really even a significant step towards its solution.

    • Salgat 6 months ago

      CPUs are trending towards heterogenous many core implementations. 16 core was considered server exclusive a few decades ago, now we're at heterogenous 24 core on an Intel 14900k cpu. The biggest limit right now is on the software side, hence my original comment. I wouldn't be surprised if someday the cpu and gpu become combined to overcome the memory wall, with many different types of specialized cores depending on the use case.

      • mlyle 6 months ago

        The software side is limited, somewhat intrinsically (there tend to be a lot of things we want to do in order--- Amdahl's law wins).

        And even when you aren't intrinsically limited by that, optimal placement doesn't reduce contention that much (assuming you're not ping-ponging a single cache line every operation or something dumb like that).

        But the hardware side, too: we're not getting transistors that quickly anymore, and we don't want anything too much smaller than an Intel E-core. Even if we stack 3D, all that net wafer area is not cheap and isn't cheapening quickly.