Erlang and IBM Power8 in the cloud: super-high single-system parallelism

jacques_chester · on Oct 22, 2014

For those as confused as I was at first, the critical line is "total time". 2.8s for the P8 vs 38.7s for the x86.

Otherwise the x86 comes out looking a lot better -- lower 95th percentile, lower max, lower average.

(Modulo usual complaints about benchmark porn: single run, lack of standard deviation, unknown configuration differences etc etc).

throwawayaway · on Oct 22, 2014

Your post should be voted top, it's the most informative. I've spent the whole day scratching my head. The power8 confusion in many ways reminds me of the p4 netburst architecture, not architecturally in any way, but in the sense that "when they are good they are really good". But both by and large are not any great shakes.

baq · on Oct 22, 2014

thanks for the explanation, had major cognitive dissonance about that post.

felixgallo · on Oct 22, 2014

yeah, sorry, it was just a quick test; I didn't even run any erlang besides building it from scratch, starting it up and seeing that BEAM recognized the right number of schedulers. It would be interesting to run a more comprehensive test suite than a single timing run, for sure.

throwaway_yy2Di · on Oct 22, 2014

How does POWER8 compare to x86, e.g. Haswell? Just skimming some of the architecture details...

* 4x hardware threads per core (8-way SMT vs. 2)

* 1/4th FP throughput per core (8 SP flops/cycle vs. 32)

* 3x bandwidth to RAM (230 GB/s vs. 68) [edit: updated for Haswell-EP]

https://en.wikipedia.org/wiki/POWER8#Specifications

http://www.redbooks.ibm.com/abstracts/tips1153.html

What is it good for?

nkurz · on Oct 22, 2014

IBM has solid documentation of more of the Power8 details here: http://www.setphaserstostun.org/power8/POWER8_UM_external_22...

Going only by specs, it seems that in many other cases each Power8 core has about 2-4x the resources of a Haswell core:

  64KB vs 32KB L1D
  512KB vs 64KB L2
  8MB vs 2.5MB L3
  16 vs 10(?) outstanding L1 requests
  8 vs 4 instructions issued per cycle
  2 vs 1 stores per cycle (or 4 vs 2 loads if no stores)
  3 vs 4/5 cycle L1 latency for 64-bit data
  5 vs 6/7 cycle L1 latency for vector data
  2048 vs 512 TLB entries (integrated with huge pages)

Specs that aren't higher than Haswell are usually the same --- skimming, I haven't found any that are lower. This makes me think that a 4xSMT or 8xSMT Power8 core will probably be about equivalent to a 2xSMT (hyperthreaded) Haswell core. Dedicated core operation (non-hyperthreaded) will vary more based on workload, but likely be 1x to 2x in favor of Power8. But these are guesses based on written specs -- I'm eager to see more real benchmarks.

justincormack · on Oct 22, 2014

Your huge page comment also reminded me that Power has 64k page support, rather than the 4k or 2MB that amd64 has, and this is normally the pag size (eg in RHEL) which reduces TLB pressure a lot.

apaprocki · on Oct 23, 2014

In AIX the text/data page size can be controlled independently per-process with environment variables. What drives me up a wall is that the per-process setting does not actually flow through to their POSIX layer. Have any code that calls sysconf(_SC_PAGESIZE)? The page size is 8k! No, wait...

jedbrown · on Oct 22, 2014

> 9x bandwidth to RAM (230 GB/s vs. 25.6)

Haswell-EP supports 4 DDR4-2133 channels, or 68 GB/s (theoretical). TDP on POWER8 is almost double and it's much more expensive, so we should think about it as having perhaps double the DRAM bandwidth (still excellent).

Note that POWER8 gets this bandwidth through many more channels, which favors the use of many memory streams. This has been the case historically, with POWER favoring data structures that result in many streams, while the same transformation has been catastrophic for memory performance on Blue Gene's PPC, with its limited prefetch engine and small number of outstanding memory requests.

rdtsc · on Oct 22, 2014

That is interesting. RAM bandwidth could significantly impact a lot of applications. I guess the only ways it to measure and compare.

The problem I see with power is it has a different endiannes and even though compilers know how to handle it, a lot of libraries and code might just assume some specific ordering (little endian) and thus fail un-expectedly on POWER.

davidw · on Oct 22, 2014

A lot of those problems were solved years ago. I remember a big push in the late 90ies, on Debian, to ensure proper endianness, because of the many architectures on which it ran.

wmf · on Oct 22, 2014

That's why Power is now little-endian.

pavlov · on Oct 22, 2014

Wasn't PowerPC always able to switch between little-endian and big-endian at boot? Maybe they migrated that capability to POWER.

(It's been a long time since I last typed "PowerPC", gave me a strange feeling of nostalgia...)

apaprocki · on Oct 22, 2014

POWER architecture can switch between BE/LE via a special purpose register (SPR). This can even be done to support mixing BE/LE threads on the core. The software complexity of managing that, though, has made them decide to enable the entire OS as BE or LE. One cool thing is that the support will be extended so that the chips can support running either BE or LE guest VMs natively.

edit: Straight from IBM https://www.ibm.com/developerworks/community/blogs/fe313521-...

justincormack · on Oct 22, 2014

Most of them, although not the Apple G5s which are BE only.

PhuFighter · on Oct 22, 2014

Power has been little endian for a while. No one bothered to tell the software guys to write a little endian OS and tools for it....

wmf · on Oct 22, 2014

My point is that little-endian Ubuntu is now available.

justincormack · on Oct 22, 2014

Yes, and IBM is pushing this (plus the v2 ABI which is a bit more like x86 and simplified in some small ways).

gsnedders · on Oct 22, 2014

As far as I'm aware, IBM intend on maintaining both big and little endian Linux ports (and toolchains).

rdtsc · on Oct 22, 2014

I had no idea. Haven't looked it since many years ago.

Well, heck, that's great then.

jacques_chester · on Oct 22, 2014

> What is it good for?

Running very demanding single-system software. It's the sort of box you'd install DB2 or Oracle on, especially if you've already tied your business to IBM solutions in the past.

PhuFighter · on Oct 22, 2014

It depends on what you mean by "demanding". I think that single threaded applications on x86 (with nothing else running) will smoke the Power8 system. But if you have lots of threads and high levels of concurrency going on, i think that the results will be very much workload dependent.

Marat_Dukhan · on Oct 22, 2014

+ Hardware Transactional Memory that isn't buggy

apaprocki · on Oct 23, 2014

Curious aside -- is there any blog / article out there where someone has tried to max out network bandwidth on an x86 box? I checked a POWER8 we have today and it has 14 PCIe3 slots (mix of 8x/16x), so theoretically one could load up 14 40GbE NICs on it. I'd be interested to see some kind of I/O shootout to see where both systems hit a wall.

e.g. Snabb Switch 20x10GbE on x86 https://lukego.github.io/blog/2013/06/23/echoing-packets-wit...

wmf · on Oct 24, 2014

I have some x86 servers with seven slots and 14 10G ports, but we only managed to get 8 ports working in netmap. Intel has done some crazy stuff with DPDK and I've seen a demo of over 300 Gbps on a 4S machine.

I've also seen some stuff on Power8 networking but it's not public.

apaprocki · on Oct 22, 2014

Worth mentioning that POWER is the only chip I'm aware of that has hardware DFP instructions.

desdiv · on Oct 22, 2014

In case anyone else is confused: DFP = Decimal Floating Point, i.e. native base 10 arithmetics on the FPU.

Googling "DFP instructions" led me to nothing but DoubleClick links.

polack · on Oct 22, 2014

Sounds interesting! What's needed from the user in order to use this feature? Will it be possible to use from high level languages like java or do you have to use assembly? Can think of quite a lot of applications that would benefit A LOT from this.

apaprocki · on Oct 23, 2014

When you compile C99 apps using _Decimal{32,64,128} types, and you enable a flag on IBM's compiler to use the DFP instructions and tell it you're compiling for at least a POWER6 (e.g., -qarch=pwr6 -qtune=pwr7) it will emit the optimized code. The same types are available in C++ apps as well. (I'm not sure if GCC/LLVM also support emitting the instructions -- haven't checked.)

If higher level languages implemented in C/C++ implement their decimal support using the built-in C/C++ types, they'll get the speed boost.

desdiv · on Oct 22, 2014

High level languages like Python, Ruby, and Go could take advantage of this without having to make any code changes, provided that the appropriate libraries are updated to use this new hardware.

Java, however, is a special case. Since Java bytecode does not have a BigDecimal type, there's no way to update the JVM to take advantages of this hardware unfortunately.

krylon · on Oct 22, 2014

I think I remember reading that zSeries has Decimal Floating Point, too, and that IBM's JVM does take advantage of the instructions on z/OS.

Of course, even if that is true, it is a different architecture. But still, if they made the change on one architecture, it would be silly not to make it on the other one as well.

Unless that was intentional to keep some sort of advantage for zSeries.

wmf · on Oct 22, 2014

I would assume that the IBM JVM uses DFP instructions to implement the BigDecimal class.

apaprocki · on Oct 23, 2014

Yes, according to IBM's presentations, BigDecimal in their JVM uses 64-bit DFP instructions on their hardware. Minimum hardware level is POWER6 or Z10 (Z9 supported via microcode).

http://www.ibm.com/developerworks/rational/cafe/docBodyAttac...

benjarrell · on Oct 22, 2014

Based on experience with POWER7, emptying your wallet.

PhuFighter · on Oct 22, 2014

it's surprising how much cheaper these Power8 boxes are compared to the Power7 boxes. You can get a tyan 2U, 1 socket system for $2800 FOB HK: http://www.tyan.com/campaign/openpower/

desdiv · on Oct 22, 2014

Previous HN discussion: https://news.ycombinator.com/item?id=8481851

Erwin · on Oct 22, 2014

A while ago (here's an article from 2002: http://lwn.net/Articles/6367/ ), the big new thing was going to be ibm mainframes running linux VMs. Whatever became of that? Is anyone still doing that?

PhuFighter · on Oct 22, 2014

the last I heard, Linux on the IBM Mainframes were driving > 50% of their sales.

rodgerd · on Oct 22, 2014

My employer does. Plenty of large companies with an existing Z footprint do.

neurotixz · on Oct 22, 2014

While the performance seen here is nice, i'm curious to see the price/performance ratio. Running against a 8-core XEON would not make sense if the closer Intel system price-wise is a quad 12-core xeon... Obviously we are talking cloud here so it might not even apply.

By my experience with Power7, the price/performance ratio is much lower on Power then Intel systems. Maybe it changed but i'm not holding my breath, even if IBM seems much more aggressive on pricing with P8 then they were with P5-P7. The

Quick calculation, absolutely unscientific:

Seeing that the price is 0.14$/hour for the 6-core xeon and 1.08$/hour on the 176 core P8, it would have to be roughly 8-10x faster to justify the cost difference, not sure it will be the case.

felixgallo · on Oct 22, 2014

the thing you're getting here is primarily throughput on a single image. Even if it's more expensive per-core per-hour, you can't discount that you'd have to work a lot harder to get the equivalent 30-box distributed solution to work properly, and even then it would have certain disadvantages owing to network latency.

nonsequ · on Oct 22, 2014

This is interesting and I'd like to hear more opinions on it. My impression is that distributed computing has been eating Power/Sparc/Z processors' lunch for a long time now because software has made up for the deficiencies of coordinating 30 boxes. Do you and do any others believe that we are at an inflection point where the pendulum swings back in the direction of 'high-performance' processors like Power8, or will improvements in 'scale-out' ease-of-use and economies of scale continue to win the day?

felixgallo · on Oct 22, 2014

The dominant use case for the last decade or so has been web servers hitting caches to do low-CPU low-causality CRUD operations. That looks unlikely to change in the next decade, so keep your Intel stock.

That said, for a lot of interesting use cases, like that king-hell postgres database sitting in the middle of the swarm, or video processing, or streams processing, or indeed any situation in which thousands-to-millions of simultaneous actors need to work on the same shared state, this sort of system starts looking real interesting.

As a thought experiment, think of this system like a GPU, except every single processor is a fully capable 2 GHz i5 running Unix, and instead of having to deal with the CUDA or OpenCL API, you can just write erlang (or haskell; .. or whatever) code and it will run. And instead of having 2-8G of RAM, you have 48G. And instead of having arcane debug tools, you have recon and gdb and ddd.

I don't think there is a pendulum, I think there's a spectrum and has always been one; pragmatism should always rule, and your use case is not my use case. There isn't going to be an objective winner ever, no matter how close Intel may get to covering much of the sweet spot.

jacques_chester · on Oct 22, 2014

If you have a problem that behaves poorly in the face of the Network Fallacies, then you want to scale vertically.

RDBMSes are a classic example. Some kinds of compute-heavy problems too -- simulations with lots of coupled components, video compression etc.

neurotixz · on Oct 22, 2014

I would be extremely surprised that you would need 30 x86 boxes to reach the performance of a P8 box, on any type workload. By my experience with P5-P7 they can be faster for certain workloads then x86, but not that much.

apaprocki · on Oct 22, 2014

You can't really compare the different chip revs apples-to-apples. P6 was a completely different chip architecture with much higher clock speeds that IBM abandoned because it didn't perform well. They make a lot of changes in each chip rev.

dschiptsov · on Oct 22, 2014

It is not just about Erlang, but any language+runtime which has been designed upon a well-known set of sound principles (immutability, share-nothing, message-passing).

As long as order of evaluation does not matter (for a pure-functional code) some [major] parts of a program could be evaluated in parallel by runtime without any changes in the code (especially when a high-order function composition - map/filter/reduce/etc is the primary pattern). So, Haskell, for example, will do it too.

dragonwriter · on Oct 22, 2014

> So, Haskell, for example, will do it too.

By "will do it" you mean "would, in theory, support a compiler that parallelized significant amounts of idiomatic code with no source changes"? Maybe. I don't think any existing Haskell compiler does that, though.

wyager · on Oct 22, 2014

I've had GHC automagically parallelize tree-traversal code before. However, I think in general you have to put a bit of work in to it.

Of course, you could theoretically just write an actor library that did the exact same thing as erlang.

vinkelhake · on Oct 22, 2014

Haskell will do what? Automatic parallelization of code?

The big difference between Erlang and Haskell is that Erlang fundamentally encourages the programmer to think in terms of individual processes that communicate by message-passing. Programmers have also been taught that spawning processes is cheap so idiomatic Erlang code typically have tons of them.

coolsunglasses · on Oct 22, 2014

> Programmers have also been taught that spawning processes is cheap so idiomatic Erlang code typically have tons of them.

The green threads in Haskell are just as cheap, in some cases, cheaper than Erlang's processes. Haskell programmers are similarly encouraged.

Concurrency and parallelism are pretty easy in Haskell.

http://chimera.labs.oreilly.com/books/1230000000929

http://hackage.haskell.org/package/slave-thread

dschiptsov · on Oct 22, 2014

Yes, indeed. Marlow's book is really good one.

felixgallo · on Oct 22, 2014

I love my haskell brethren dearly, but haskell isn't organized structurally around an essentially-mandatory fundamentally concurrent, debuggable, quasi-preemptive core like OTP. Not that it won't get there eventually (cf. Cloud Haskell) but it's more than just being able to run concurrent threads. Note: this is not a language flame, just a personal opinion/observation, and I love and welcome correction.

coolsunglasses · on Oct 22, 2014

Haskellers like to pick and choose their abstractions. That's why we have MVar and TVar (STM) and IVar and locks/semaphores and unbounded channels and bounded channels and software transactional channels and green threads and Async and OS threads…

You're going to see what Erlang did atomized, implemented piecemeal, and reconstructed. Cloud Haskell burnt too much goodwill in pursuit of a mostly pointless feature.

That's how we ended up with slave-thread…the beginnings of supervisor trees.

Haskellers would rather pick and choose. It's worked well so far.

darksaints · on Oct 23, 2014

Let us know when you get to 9-nines on a distributed application.

dschiptsov · on Oct 24, 2014

You are right. Haskell is an academic research vehicle while Erlang is a pragmatic "industry-strength" (i hate these buzzwords, but they are useful generalizations) functional language.

In theory, we could re-write almost any part of OTP in Haskell, in practice, however, "runtime is hard" and the biggest selling point of Erlang is that its runtime and standard library (OTP) has been evolved for almost 20 years and is based on the same principles of a functional language (except, perhaps lazyness by default) as Haskell.

In other words, while Haskell is a "playground for researches of type theory in context of programming languages and compilers" Erlang/OT is a "telecom industry-standard toolkit".

A close analogy, in my opinion, could be the pair of Scheme/Racket and CL. The first was mostly an academic teaching and research tool, while second is a pragmatic (but unfortunately bloated) tool-set.

PhuFighter · on Oct 22, 2014

Interesting simple cup test. I can't wait to see other, more comprehensive applications.

listic · on Oct 22, 2014

Which one is that plan that allows you to "spend about $1"?

felixgallo · on Oct 22, 2014

https://cloud.runabove.com/login/?launch=power8 leads to the power8 pages. Not a super well organized site imo.