Comment by lazide

Comment by lazide 15 hours ago

9 replies

Eh, in this case not splitting them up to compute them in parallel is the smartest thing to do. Locking overhead alone is going to dwarf every other cost involved in that computation.

gdwatson 15 hours ago

Yeah, I think the dream was more like, “The compiler looks at a map or filter operation and figures out whether it’s worth the overhead to parallelize it automatically.” And that turns out to be pretty hard, with potentially painful (and nondeterministic!) consequences for failure.

Maybe it would have been easier if CPU performance didn’t end up outstripping memory performance so much, or if cache coherency between cores weren’t so difficult.

  • eptcyka 13 hours ago

    Spawning threads or using a thread pool implicitly would be pretty bad - it would be difficult to reason about performance if the compiler was to make these choices for you.

  • lazide 14 hours ago

    I think it has shaken out the way it has, is because compile time optimizations to this extent require knowing runtime constraints/data at compile time. Which for non-trivial situations is impossible, as the code will be run with too many different types of input data, with too many different cache sizes, etc.

    The CPU has better visibility into the actual runtime situation, so can do runtime optimization better.

    In some ways, it’s like a bytecode/JVM type situation.

    • PinkSheep 9 hours ago

      If we can write code to dispatch different code paths (like has been used for decades for SSE, later AVX support within one binary), then we can write code to parallelize large array execution based on heuristics. Not much different from busy spins falling back to sleep/other mechanisms when the fast path fails after ca. 100-1000 attempts to secure a lock.

      For the trivial example of 2+2 like above, of course, this is a moot discussion. The commenter should've lead with a better example.

      • lazide 9 hours ago

        Sure, but it’s a rare situation (by code path) where it will beat the CPU’s auto optimization, eh?

        And when that happens, almost always the developer knows it is that type of situation and will want to tune things themselves anyway.

maccard 5 hours ago

I think you’re fixating on the very specific example. Imagine if instead of 2 + 2 it was multiplying arrays of large matrices. The compiler or runtime would be smart enough to figure out if it’s worth dispatching the parallelism or not for you. Basically auto vectorisation but for parallelism

  • lazide 5 hours ago

    Notably - in most cases, there is no way the compiler can know which of these scenarios are going to happen at compile time.

    At runtime, the CPU can figure it out though, eh?

    • maccard 3 hours ago

      I mean, theoretically it's possible. A super basic example would be if the data is known at compile time, it could be auto-parallelized, e.g.

          int buf_size = 10000000;
          auto vec = make_large_array(buf_size);
          for (const auto& val : vec)
          {
              do_expensive_thing(val);
          }
      
      this could clearly be parallelised. In a C++ world that doesn't exist, we can see that it's valid.

      If I replace it with int buf_size = 10000000; cin >> buf_size; auto vec = make_large_array(buf_size); for (const auto& val : vec) { do_expensive_thing(val); }

      the compiler could generate some code that looks like: if buf_size >= SOME_LARGE_THRESHOLD { DO_IN_PARALLEL } else { DO_SERIAL }

      With some background logic for managing threads, etc. In a C++-style world where "control" is important it likely wouldn't fly, but if this was python...

          arr_size = 10000000
          buf = [None] * arr_size
          for x in buf:
              do_expensive_thing(x)
      
      could be parallelised at compile time.
      • lazide 3 hours ago

        Which no one really does (data is generally provided at runtime). Which is why ‘super smart’ compilers kinda went no where eh?