> But the CPU than needs to use a built in "hardware JIT" to translate the sequential command streams it receives to its internal data-flow operations. Somethings that's extremely inefficient and overly complicated—besides that it conceptually can't even help much. There is only so much parallelism extractable form an inherently sequential description of computation… (The most crazy part about this is that we have plenty information available on the high level which we happy erase during compilation, so we can present a "nice flat command stream" to the CPU. That's imho pure insanity).
Nonsense
ca 30 years ago there were (serious!) efforts to fix this on the CPU level - this whole VLIW CPU/smart compiler with explicit parallelism. Intel Itanium, Transmeta, Elbrus to name a few. It didn't work out well.
System research might produce something looking good but not working out well at the end.
But I tried already to explain why I think those things failed: It's all about the software.
You can't build a "sufficiently smart compiler" if the task is to parallelize an inherently sequential command stream.
This endeavor failed because it was part of the wrong effort.
The whole point is that people didn't think in terms of systems design any more at this time! They looked on "the hardware problem" in separation.
The expectation was that you can take a program written for "a fast PDP-7" add some "smart compiler" magic on top, and let this run (hopefully) faster on an improved hardware architecture. This didn't work out. Imho to no surprise.
I think the problem lies in the mental model behind computers which dominates the perception is since some time. We're almost exclusively (at least when it comes to general computation) left with the the model of a sequential command stream machine. CPUs execute (interpret) a sequence of instructions. Those instructions may alter an "infinite" (or at least "gigantic enough") global state in arbitrary ways. That's the basic model.
As long as your software is basically written with this model as background you can't "fix this on the hardware level", at least without real magic.
What we no do is to mitigate the problem as best as we can with methods that reached imho completely insane levels.
I think that we could find much better solutions if we would take a step back and rethink the machine in whole. I mean, software hardware co-design is a thing, only not when it comes to general computation especially not together with also OS / runtime (software) design.
My current personal favorite idea how thing could be improved and simplified is to see a computer as a graph of event-stream processors. This mental picture scales nicely on different "zoom levels". From raw hardware to the runtime level to (distributed?) application design. Only that it would require languages that compile down to such a model in an efficient way. A command stream language wouldn't map good or likely even acceptably (same problem as with the old "more exotic" hardware designs).
BTW: For number crunching VLIW and compiler guided parallelism works quite fine. Vector CPUs anybody? That were the fastest computes once, afik. And since MMX up to AVX the whole idea is a big mainstream success. (We're not even talking about GPUs here of course, that's not the topic; nobody tried to replace "typical" CPUs with GPUs so this does not count).
Nonsense
ca 30 years ago there were (serious!) efforts to fix this on the CPU level - this whole VLIW CPU/smart compiler with explicit parallelism. Intel Itanium, Transmeta, Elbrus to name a few. It didn't work out well.
System research might produce something looking good but not working out well at the end.