It is valuable only if it has many users, e.g., application code, optimized for the ISA.
HW's job is to then run that code with good perf/cost.
OpenPOWER has little software.
ARM has a lot of software.
So from that POV, ARM is already many orders of magnitude more valuable than OpenPOWER.
But it doesn't end there. Do you need some software to be extremely optimized for ARM? ARM can do this for you at resonable price, no need to hire.
Also, for OpenPOWER, you need to hire 50-100 Facebook engineers, at 400k$/year, and it'll take them >3 years to produce a chip design, which then needs to be verified, etc. and then needs to be built, so you'll need a fab, specialized on OpenPOWER, or not. A fab churns 40k chips/month, so how many chips / month do these companies need ?
With ARM, you pick one of the many ARM farms, and there is little for you to do. And you get 5 engineers, and they just customize an ARM design to your needs. And they ship in 1 year instead of 3. And next year ARM gives you a way to update your chip to the next generation. And if next year you need some other feature, ARM gives it to you. And if you need software, like C library, profilers, math, all that is supplied by Arm.
And they take royalties on chips you built, and.... and....
So ARM is many orders of magnitude cheaper in perf / $ than OpenPower. Not only is the hardware better, but it is better, cheaper, has more software, and tools, and teams of experts ready to help your team, etc.
Porting most software to ARM64, Power, or RISC-V involves typing some variation of "make." Only a small percentage of software written in C/C++ or ASM is problematic. Anything in a higher level language like Go or a newer language like Rust is generally 100% portable.
Switching from X86_64 to ARM64 (M1) for my desktop dev system was trivial.
Endian-ness used to bite, but today virtually everything is little-endian above embedded. Power and some ARM support both modes but almost always run in little-endian mode (e.g. ppc64le).
- Have you ever, e.g., computed the sinus of a floating point number in C (sinf) ?
- Have you ever multiplied a matrix with a vector, or a matrix with a matrix (GEMM) using BLAS?
- Have you ever done an FFT ?
- Have you used C++ barriers? Or pthreads? Or mutexes?
An optimized implementation achieves ~100% of theoretical peak performance of a CPU on all of those, and these are all tailored to each CPU model.
There is software on any running system doing those things all the time.
Running at 0% of the peak just means increased power consumption, latency, time to finish, etc.
Generic versions perform at < 100%, often at ~0% (0.1%, 0.001%, etc.) of theoretical peak.
Somebody has to write software for doing this things for the actual hardware, so that you can then call them from python.
IBM has dozens of "open source" bounties open for PowerPC, and they pay real $$$, but nobody implements them.
---
Porting software to PowerPC is only as simple as doing make if the libraries your software uses (the C standard library, the libm library, BLAS, etc. ) all have optimized implementations, which isn't the case.
So when considering PowerPC, you have to divide the paper numbers by 100 if you want to get the actual numbers normal code recompiled with make gets in practice. And then you have to invest extra $$$ into improving that software, cause nobody will do it for you.
Er, no. I do that stuff (well, I'm not clever enough for C++ generally, and it would be OpenMP rather than plain pthreads) on the sort of nodes that Sierra uses. However they mostly use the GPUs, for which POWER9 has particular support. Then I can tell there isn't currently any GEMV or FFT running on this system, and not "all the time" even on our HPC nodes.
While it isn't necessarily clear what peak performance means, MKL or OpenBLAS, for instance, is only ~100% of serial peak on large DGEMM for a value of 100 = 90; ESSL is similar. I haven't measured GEMV (ultimately memory-bound), but I got ~75% of hand-optimized DGEMM performance on Haswell with pure C, and I'd expect similar on POWER if I measured. Those orders of magnitude are orders off, even for, say, reference BLAS. I don't know why I need Python, but the software clearly exists -- all those things and more (like vectorized libm). You can even compile assorted x86 intrinsics on POWER, though I don't know how well they perform relative to on equivalent x86, but I think you're typically better off with an optimizing compiler anyway.
I've packaged a lot of HPC/research software, which is almost all available for ppc64le; the only things missing are dmtcp, proot, and libxsmm (if libsmm isn't good enough).
You start with BLAS being a factor 2 off, and then go to PETSc, and are another couples of factors off, and then the actual app the scientist wrote, which many use all of the above and the kitchen sink, where every piece and the pieces they use are all a couple of factors off, and then your scientist app is at 0.01% of peak.
If you have used Sierra since the beginning, we have seen significant performance increases over the years, because the people using it have actually been discovering and then either getting IBM to fix, or fixing themselves, most of the software.
Compared with Power 10, I'd say that Power 9 is "mainstream" (many clusters available), and from the Power 9 CPUs in existence, IBM's are the most mainstream of them all.
Take the Power 10 ISA, build your own CPU that significantly differs from IBM's, and good luck with optimizing all the software above. It can be done, and dumping it on a couple of HPC sites where then scientists and staff won't have any change but to use it for 4-6 years is a good way to get that done.
But for a private company that just wants to deliver value, ARM is just a much better deal, cause it saves them from having to do any of this.
Endianess is not the only problem. You can have issues with different cache coherency model, different alignment requirements, different syscalls (which are partially arch-dependent, at least on Linux). The fact that the switch from x86 to arm was trivial just proves the point that arm has matured really well.
How? For example, a different memory model isn't something you can just flip a switch to fix — someone needs to review/test application code to see whether it has latent bugs which are masked (or simply harder to reproduce) on x86. Apple went to the trouble implementing support in their silicon to avoid that but if you don't run on similar hardware it's likely that you'll hit some issue in this regard for any multithreaded program, and those are exactly the kinds of bugs which people are going to miss in a simple cross-compile with limited testing.
x86 is at step 10000, ARM at step 5000, power is at step 0.
Firefox "worked" before this post on power. Now somebody put enough effort to actually make it usable.
The fact that you don't see people complaining about Firefox PowerPC performance on Linux is not because performance was good - it was unusably slow - but because nobody uses Firefox on Power.
Think about what that means. Think about how many bugs in Firefox are reported _every day_ for x86 and ARM, and how many are reported for PowerPC. Is that also because the PowerPC version has no bugs? (no, it is because nobody uses it, nobody reports them, and nobody fixes them).
> x86 is at step 10000, ARM at step 5000, power is at step 0.
I agree with your general point, but I do believe that Power is the most "practical" ISA after x86 and ARM - albeit it's a distant third, it's definitely not at 0. It has the full support of a bunch of mainstream distros, public container registries have a decent amount of support for their images, and people actually run pretty serious workloads on Linux on Power.
Power does have a lot of niche backing, albeit it's continuously being hurt by IBM's total lack of interest in doing anything but push it beyond the billion dollar contracts they're milking with it. That's totally destroying any mindshare Power has. There's really no way to get a cloud shell on a modern Power machine, or physical access to a modern one without forking over thousands of dollars for the privilege (the latter only really is possible due to Talos' amazing efforts, bless em).
I think i agree with what you say, however, an important detail is that any given bug (known or not) has pretty low odds to be architecture specific.
Afaik this article is a big deal because it's JIT, so a big chunk of architecture specific code to get good performance. But most code is not going to be that.
That's not to say that architecture specific bugs will not exist. But i think your outlook on this is a little pessimistic.
Well, ahem, somebody does try to fix them, and we do get reports which get triaged (I know this, because I've done a number of the fixes, some of which were not trivial). There are much fewer of them, which I think is your point, but there aren't none, and there isn't nobody who cares. I think you're overplaying your hand here.
If the implication is that POWER is somehow new, I first used it when it was RS/6000 and introduced FMA. There was subsequently a rather large installation at my lab. Firefox without the JIT is only a problem with the "modern" web, and I default to turning off Javascript anyway, and I guess someone uses it to make it worth porting.
Prior to this work, Firefox was also ported in the sense that it ran but it was much slower because it had not been optimized. How much of the software which has been compiled for Power has been well-tested, much less optimized?
It is valuable only if it has many users, e.g., application code, optimized for the ISA.
HW's job is to then run that code with good perf/cost.
OpenPOWER has little software. ARM has a lot of software.
So from that POV, ARM is already many orders of magnitude more valuable than OpenPOWER.
But it doesn't end there. Do you need some software to be extremely optimized for ARM? ARM can do this for you at resonable price, no need to hire.
Also, for OpenPOWER, you need to hire 50-100 Facebook engineers, at 400k$/year, and it'll take them >3 years to produce a chip design, which then needs to be verified, etc. and then needs to be built, so you'll need a fab, specialized on OpenPOWER, or not. A fab churns 40k chips/month, so how many chips / month do these companies need ?
With ARM, you pick one of the many ARM farms, and there is little for you to do. And you get 5 engineers, and they just customize an ARM design to your needs. And they ship in 1 year instead of 3. And next year ARM gives you a way to update your chip to the next generation. And if next year you need some other feature, ARM gives it to you. And if you need software, like C library, profilers, math, all that is supplied by Arm.
And they take royalties on chips you built, and.... and....
So ARM is many orders of magnitude cheaper in perf / $ than OpenPower. Not only is the hardware better, but it is better, cheaper, has more software, and tools, and teams of experts ready to help your team, etc.