I am not sure we are disagreeing on much, but the 4 core i7 in my dev MacBook is a whole lot different than the dual socket, 56 core machines we run on.
Optimizations that need to happen, don’t happen locally, they get tuned on a node in the cluster. Look at all the work Goto has done on Goto Blas.
We agree on HPC, however I also agree with Linus about non-HPC loads. Software and developers are always more expensive than hardware, but scaling beyond a certain point in hardware (number of servers, or the GPUs you need) drives the hardware and maintenance cost up, hence the difference becomes negligible, or the maintenance becomes unsustainable. This is why everyone is trying to run everything faster with the same power budget. At the end, after a certain point, everyone wants to run native code at the backend to reap the power of the hardware they help. This is why I think Linus is right about ARM. That's not I'm not supporting them, but they need to be able to run some desktops or "daily driver" computers which support development. Java's motto was write once, run everywhere, which was not enough to stop migration to x86. Behavioral uniformity is peace of mind, and is a very big peace TBH.
What I wanted to say is, unless the code you are writing consists of interdependent threads and the minimum thread count is higher than your laptop, you can do 99% of the optimization on your laptop. On the other hand, if the job is single threaded or the threads are independent, the performance you obtain in your laptop per core is very similar to the performance you get on the server.
For BLAS stuff I use Eigen, hence I don't have experience with xBLAS and libFLAME, sorry.
From a hardware perspective, a laptop and a server is not that different. Just some different controllers and resiliency features.
Optimizations that need to happen, don’t happen locally, they get tuned on a node in the cluster. Look at all the work Goto has done on Goto Blas.