Linus is mostly wrong except for HPC. Very few dev pipelines for folks result in...

bayindirh · on Feb 22, 2019

--- I accidentally deleted this comment, so, I've re-written it. ---

Disclaimer: I'm a HPC system administrator in a relatively big academic supercomputer center. I also develop scientific applications to run on these clusters.

> Linus is mostly wrong except for HPC. Very few dev pipelines for folks result in native executables. The vast majority of code is delivered as either source (python, ruby, etc) or bytcode, JVM, Scalia, etc.

Scientific applications targeted for HPC environments contain the most hardcore CPU optimizations. They are compiled according to CPU architecture and the code inside is duplicated and optimized for different processor families in some cases. Python is run with PyPy with optimized C bindings, JVM is generally used in UI or some very old applications. Scala is generally used in industrial applications.

> And the Xeon class machines folks deploy to in data center envs is a world apart from their MacBooks.

No, they don't. Xeon servers generally have more memory bandwidth, and more resiliency checks (ECC, platform checks, etc.). Considering the MacBook Pro have a same-generation CPU with your Xeon server with a relatively close frequency, per core performance will be very similar. There won't be special instructions, frequency enhancing gimmicks, or different instruction latencies. If you optimize well, you can get the same server performance from your laptop. Your server will scale better, and will be much more resilient in the end, but the differences end there.

> Even those creating native binaries, this is done through ci/cd pipelines.

Cross compilation is a nice black box which can add behavioral differences to your code which you cannot test in-house. Especially if you're doing leading/cutting edge optimizations in the source code level.

unilynx · on Feb 22, 2019

Isn't turbo boost an issue when comparing/profiling? my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons... if left to cool down for a few minutes between test runs. otherwise a testrun of 30 seconds would suddenly jump up to over a minute.

the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)

always attributed that to the turboboost..

bayindirh · on Feb 22, 2019

> Isn't turbo boost an issue when comparing/profiling?

No. In HPC world, profiling is not always done over "timing". Instead, tools like perf are used to see CPU saturation, instruction hit/retire/miss ratios. Same for cache hits and misses. For more detailed analysis, tools like Intel Parallel Studio or its open source equivalents are used. Timings are also used, but for scaling and "feasibility" tests to test whether the runtime is acceptable for that kind of job.

OTOH, In a healthy system room environment, server's cooling system and system room temperature should keep the server's temperature stable. This means your timings shouldn't deviate too much. If lots of cores are idle, you can expect a lot of turbo boost. For higher core utilization, you should expect no turbo boost, but no throttling. If timings start to deviate too much, Intel's powertop can help.

> my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons...

If the CPUs are from the same family, and speed are comparable, your servers may have turbo boost disabled.

> otherwise a testrun of 30 seconds would suddenly jump up to over a minute.

This seems like thermal throttling due to overheating.

> the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)

Servers' have many options for fine tuning CPU frequency response and limits. The servers may have turbo boost disabled, or if you saturate all the cores, turbo boost is also disabled due to in-package thermal budget.

If you have any more questions, I'd do my best to answer.

plasticchris · on Feb 22, 2019

Sounds like thermal throttling. I don't think there's any reason turbo can't be continuous if thermals are under control, see https://en.wikichip.org/wiki/intel/frequency_behavior

sitkack · on Feb 24, 2019

I am not sure we are disagreeing on much, but the 4 core i7 in my dev MacBook is a whole lot different than the dual socket, 56 core machines we run on.

Optimizations that need to happen, don’t happen locally, they get tuned on a node in the cluster. Look at all the work Goto has done on Goto Blas.

bayindirh · on Feb 24, 2019

We agree on HPC, however I also agree with Linus about non-HPC loads. Software and developers are always more expensive than hardware, but scaling beyond a certain point in hardware (number of servers, or the GPUs you need) drives the hardware and maintenance cost up, hence the difference becomes negligible, or the maintenance becomes unsustainable. This is why everyone is trying to run everything faster with the same power budget. At the end, after a certain point, everyone wants to run native code at the backend to reap the power of the hardware they help. This is why I think Linus is right about ARM. That's not I'm not supporting them, but they need to be able to run some desktops or "daily driver" computers which support development. Java's motto was write once, run everywhere, which was not enough to stop migration to x86. Behavioral uniformity is peace of mind, and is a very big peace TBH.

What I wanted to say is, unless the code you are writing consists of interdependent threads and the minimum thread count is higher than your laptop, you can do 99% of the optimization on your laptop. On the other hand, if the job is single threaded or the threads are independent, the performance you obtain in your laptop per core is very similar to the performance you get on the server.

For BLAS stuff I use Eigen, hence I don't have experience with xBLAS and libFLAME, sorry.

From a hardware perspective, a laptop and a server is not that different. Just some different controllers and resiliency features.

paulmd · on Feb 22, 2019

Even in a bytecode language, there is no guarantee that an application is write-once run-everywhere. I converted a small app that was running on Windows with Oracle JDK to run on Linux with OpenJDK and it was not plug-and-play. It was close, but there were a few errors particularly surrounding path resolution (and yes, the Windows application was already using Unix-style paths, this was actually a difference in how paths were resolved). Similarly, there are small differences between Tomcat and Jetty and so on.

This wasn't showstopping by any means, but it did take a couple of hours to tweak it until it ran properly, and this was just a small webapp not really doing anything exceptional.

Our main line-of-business app (on Java) runs on SPARC/Solaris in production, so we have on-premises test servers so we can test this... and yes, there have been quite a few instances where we identified significant performance anomalies between developer machines running x86/Windows and our Sparc/Solaris test environment, and had to go rewrite some troublesome functions.

pjmlp · on Feb 22, 2019

The same can happen on the same hardware just by switching versions of the same toolchain.

So Linus position is a bit of straw man.

paulmd · on Feb 26, 2019

Correct, we need to stabilize all these factors in order to ensure stable, bug free deployment. A Good Post.

Oh, you meant that just because there is one other thing that you might slip up and forget to control for, we shouldn't bother trying to control anything? No, wait, that's actually A Very Bad Opinion.

rkangel · on Feb 22, 2019

He calls that out though "even if you're only running perl scripts". It's not the cross-compilation that's a factor, it's wanting the environment to be as similar as possible.

Even if your code is Java bytecode, that's still running on a different build of the JVM, on a different build of the OS (possibly a different OS). There is opportunity for different errors to crop up. They might be rare, but they'll be surprising and costly when they happen exactly because of that.

MaxBarraclough · on Feb 22, 2019

The question then is how successful the JVM is. I think you're underestimating it. Torvalds' attitude is certainly justified regarding plenty of other types of software though -- just building a C++ project on a different distro can be a pain.

Someone else [0] points out that Java (in the right context at least) is so successful in isolating the developer from the underlying platform, that it isn't a problem if the developer isn't even permitted to know what OS/hardware their code will run on.

Could they accidentally write code that depends on some quirk of the underlying platform? I think it's not that likely. Nowhere near as likely as in C/C++, where portability is a considerable uphill battle that takes skill and attention on the part of the developer.

> They might be rare, but they'll be surprising and costly when they happen exactly because of that.

Ok, but you can say the same for routine software updates. It's a question of degree.

[0] https://news.ycombinator.com/item?id=19229224

disiplus · on Feb 22, 2019

he just said its not worth it and as a developer if i could chose that i develop on the platform that will run my code i will chose it even if its slightly more expensive. granted that on both i could be same level of productive.

we had those problems when developing in scripting language on windows a code that will run on linux because at some point we needed something that called native and would make us problems with different behavior. after some of that experience we tried to get everybody the same environment that is close to what will run in production.

TechieKid · on Feb 22, 2019

Usage of punctuation would make your attempt at communication more likely to serve its purpose.

disiplus · on Feb 22, 2019

thx, noted. was on my phone and wanted to reply quickly.