When we talk about support for threading/concurrent programming in programming languages, it is less about how to reach the theoretical limits of your system best, especially if you are free to architect the whole software stack towards that goal. In that case, your statements might apply.
It is about how easily a programmer, who deals with a certain subtask in a system, can utilize more cores for the this task. Not talking about supercomputing, but looking at a smarktphone or a typical PC. There you usually have most cores just idle unused, but if the user triggers an action, you want to be able to use as many cores as it speeds up computation. Language support for parallelism makes a huge difference there. In Go I can write a function to do a certain computation and quite often it is trivial to spread several calls across goroutines.
You are not factoring in the cost of context switches, and that many user applications today are memory-bound and not CPU-bound.
It's one of the secrets exploited by the M1 chip, seen in how many more cache lines the CPU's LFB can fill concurrently compared to Intel chips and that these are now 128 byte cache lines instead of 64 byte cache lines.
Which context switches? With the Go model, I have exactly one thread per CPU, no context switches. And if you are memory-bound, why have more CPUs?
But sure, there is a reason why the M1 has so stellar performance, it has one of the fastest single-thread performances and many applications do not manage to load more than 4 cores for common tasks - which partially is also a consequence of doing that is difficult in many programming languages, but easy in some, which are only slowly gaining traction.
> Which context switches? With the Go model, I have exactly one thread per CPU, no context switches.
Not in the user application model you were describing. Those threads would need to coordinate and communicate (for example, back to the user interface), and that implies context switches.
However, for independent processes, each additional CPU adds memory bandwidth (according to the NUMA model) because there's a concurrency limit to each CPU's LFB that puts an upper bound of 6 GB/s on filling cache lines for cache misses (even if the bandwidth of your memory system is actually much higher): https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/parallel_d...
It is about how easily a programmer, who deals with a certain subtask in a system, can utilize more cores for the this task. Not talking about supercomputing, but looking at a smarktphone or a typical PC. There you usually have most cores just idle unused, but if the user triggers an action, you want to be able to use as many cores as it speeds up computation. Language support for parallelism makes a huge difference there. In Go I can write a function to do a certain computation and quite often it is trivial to spread several calls across goroutines.