I imagine that if you compiled an idiomatic program in a language like this down to machine code, the result would actually look quite different from assembly generated for an idiomatic imperative program for the same task. For instance, even if you optimised your compiler to keep the top of the stack in registers as often as possible, the "calling conventions" of your functions would be all over the place.
At the same time, presumably, common processor architectures have all undergone decades of targeted optimisation to improve their performance on the typical imperative code that is encountered in system and user programs. What are the implications of this? Should we expect a theoretical cap on the real-world performance of concatenative programs that is well below what we can achieve with mainstream programming languages?
If you restrict the input language sufficiently so that all functions have a known number of inputs and outputs and all conditionals leave the stack balanced (as Cat does by virtue of its static type system), you can convert a stack program into SSA form pretty easily and apply the same optimizations as you would with an applicative language.
I beg to differ... CPUs have undergone decades of targeted optimisation, but not for typical code. Rather, for optimal code.
CPUs have been designed to let you write C that runs fast, not to make typical C run fast. Maybe the CPU-wallahs wanted the latter, but they haven't really achieved it — avoiding pipeline stalls and unpredicted branches can make factor-of-100 differences and and that's not something "typical C" (or Java for that matter) does.
I mean: If you want to avoid pipeline stalls due to uncachable main memory reads, writing code like typical C isn't your first solution.
This seems wrong.
I work on compilers, most CPU design advancements from the last several decades are aligned towards making bad code run fast.
This is usually a problem for people who write 'fast code', they have to hire a bunch of compiler devs to stop the compiler and CPU invalidating each other's optimizations, and to implement one-off domain specific optimizations.
This is how a vast majority of compiler people are employed nowadays.
I have no doubt that CPU development (like compiler development) aims to make bad code run fast. However, the result is what's intented modulated by what's possible.
Thirty years ago a single misplaced read that goes all the way to main memory couldn't take as much time as a hundred well-conceived instructions, even in the worst case. Now it can. A badly chosen data structure that ends up causing a loop over a linked list ends up sucking. I don't think that's anyone's intention. Noone wanted to make choosing the right data structure become a bigger advantage. CPU frequencies outran main-memory frequencies, the number of instructions issued per cycle grew, but not because anyone wanted to punish linked-list iterators.
I think what I really wanted to say was: CPU vendors try to improve. The things they try to improve don't add up to "make bad code run fast", even if much of their motivation may be that. "Increase the number of instructions issued per cycle" or "improve the branch predictor" bends the effect towards code which is written with some awareness of the CPU, ie. good code. Bad code is IMO almost synonymous with code written without awareness.
At the same time, presumably, common processor architectures have all undergone decades of targeted optimisation to improve their performance on the typical imperative code that is encountered in system and user programs. What are the implications of this? Should we expect a theoretical cap on the real-world performance of concatenative programs that is well below what we can achieve with mainstream programming languages?