Comment by gdwatson
Yeah, I think the dream was more like, “The compiler looks at a map or filter operation and figures out whether it’s worth the overhead to parallelize it automatically.” And that turns out to be pretty hard, with potentially painful (and nondeterministic!) consequences for failure.
Maybe it would have been easier if CPU performance didn’t end up outstripping memory performance so much, or if cache coherency between cores weren’t so difficult.
I think it has shaken out the way it has, is because compile time optimizations to this extent require knowing runtime constraints/data at compile time. Which for non-trivial situations is impossible, as the code will be run with too many different types of input data, with too many different cache sizes, etc.
The CPU has better visibility into the actual runtime situation, so can do runtime optimization better.
In some ways, it’s like a bytecode/JVM type situation.