Your post should be voted top, it's the most informative. I've spent the whole day scratching my head. The power8 confusion in many ways reminds me of the p4 netburst architecture, not architecturally in any way, but in the sense that "when they are good they are really good". But both by and large are not any great shakes.
yeah, sorry, it was just a quick test; I didn't even run any erlang besides building it from scratch, starting it up and seeing that BEAM recognized the right number of schedulers. It would be interesting to run a more comprehensive test suite than a single timing run, for sure.
Going only by specs, it seems that in many other cases each Power8 core has about 2-4x the resources of a Haswell core:
64KB vs 32KB L1D
512KB vs 64KB L2
8MB vs 2.5MB L3
16 vs 10(?) outstanding L1 requests
8 vs 4 instructions issued per cycle
2 vs 1 stores per cycle (or 4 vs 2 loads if no stores)
3 vs 4/5 cycle L1 latency for 64-bit data
5 vs 6/7 cycle L1 latency for vector data
2048 vs 512 TLB entries (integrated with huge pages)
Specs that aren't higher than Haswell are usually the same --- skimming, I haven't found any that are lower. This makes me think that a 4xSMT or 8xSMT Power8 core will probably be about equivalent to a 2xSMT (hyperthreaded) Haswell core. Dedicated core operation (non-hyperthreaded) will vary more based on workload, but likely be 1x to 2x in favor of Power8.
But these are guesses based on written specs -- I'm eager to see more real benchmarks.
Your huge page comment also reminded me that Power has 64k page support, rather than the 4k or 2MB that amd64 has, and this is normally the pag size (eg in RHEL) which reduces TLB pressure a lot.
In AIX the text/data page size can be controlled independently per-process with environment variables. What drives me up a wall is that the per-process setting does not actually flow through to their POSIX layer. Have any code that calls sysconf(_SC_PAGESIZE)? The page size is 8k! No, wait...
Haswell-EP supports 4 DDR4-2133 channels, or 68 GB/s (theoretical). TDP on POWER8 is almost double and it's much more expensive, so we should think about it as having perhaps double the DRAM bandwidth (still excellent).
Note that POWER8 gets this bandwidth through many more channels, which favors the use of many memory streams. This has been the case historically, with POWER favoring data structures that result in many streams, while the same transformation has been catastrophic for memory performance on Blue Gene's PPC, with its limited prefetch engine and small number of outstanding memory requests.
That is interesting. RAM bandwidth could significantly impact a lot of applications. I guess the only ways it to measure and compare.
The problem I see with power is it has a different endiannes and even though compilers know how to handle it, a lot of libraries and code might just assume some specific ordering (little endian) and thus fail un-expectedly on POWER.
A lot of those problems were solved years ago. I remember a big push in the late 90ies, on Debian, to ensure proper endianness, because of the many architectures on which it ran.
POWER architecture can switch between BE/LE via a special purpose register (SPR). This can even be done to support mixing BE/LE threads on the core. The software complexity of managing that, though, has made them decide to enable the entire OS as BE or LE. One cool thing is that the support will be extended so that the chips can support running either BE or LE guest VMs natively.
Running very demanding single-system software. It's the sort of box you'd install DB2 or Oracle on, especially if you've already tied your business to IBM solutions in the past.
It depends on what you mean by "demanding". I think that single threaded applications on x86 (with nothing else running) will smoke the Power8 system. But if you have lots of threads and high levels of concurrency going on, i think that the results will be very much workload dependent.
Curious aside -- is there any blog / article out there where someone has tried to max out network bandwidth on an x86 box? I checked a POWER8 we have today and it has 14 PCIe3 slots (mix of 8x/16x), so theoretically one could load up 14 40GbE NICs on it. I'd be interested to see some kind of I/O shootout to see where both systems hit a wall.
I have some x86 servers with seven slots and 14 10G ports, but we only managed to get 8 ports working in netmap. Intel has done some crazy stuff with DPDK and I've seen a demo of over 300 Gbps on a 4S machine.
I've also seen some stuff on Power8 networking but it's not public.
Sounds interesting! What's needed from the user in order to use this feature? Will it be possible to use from high level languages like java or do you have to use assembly?
Can think of quite a lot of applications that would benefit A LOT from this.
When you compile C99 apps using _Decimal{32,64,128} types, and you enable a flag on IBM's compiler to use the DFP instructions and tell it you're compiling for at least a POWER6 (e.g., -qarch=pwr6 -qtune=pwr7) it will emit the optimized code. The same types are available in C++ apps as well. (I'm not sure if GCC/LLVM also support emitting the instructions -- haven't checked.)
If higher level languages implemented in C/C++ implement their decimal support using the built-in C/C++ types, they'll get the speed boost.
High level languages like Python, Ruby, and Go could take advantage of this without having to make any code changes, provided that the appropriate libraries are updated to use this new hardware.
Java, however, is a special case. Since Java bytecode does not have a BigDecimal type, there's no way to update the JVM to take advantages of this hardware unfortunately.
I think I remember reading that zSeries has Decimal Floating Point, too, and that IBM's JVM does take advantage of the instructions on z/OS.
Of course, even if that is true, it is a different architecture. But still, if they made the change on one architecture, it would be silly not to make it on the other one as well.
Unless that was intentional to keep some sort of advantage for zSeries.
Yes, according to IBM's presentations, BigDecimal in their JVM uses 64-bit DFP instructions on their hardware. Minimum hardware level is POWER6 or Z10 (Z9 supported via microcode).
it's surprising how much cheaper these Power8 boxes are compared to the Power7 boxes. You can get a tyan 2U, 1 socket system for $2800 FOB HK: http://www.tyan.com/campaign/openpower/
A while ago (here's an article from 2002: http://lwn.net/Articles/6367/ ), the big new thing was going to be ibm mainframes running linux VMs. Whatever became of that? Is anyone still doing that?
While the performance seen here is nice, i'm curious to see the price/performance ratio. Running against a 8-core XEON would not make sense if the closer Intel system price-wise is a quad 12-core xeon... Obviously we are talking cloud here so it might not even apply.
By my experience with Power7, the price/performance ratio is much lower on Power then Intel systems. Maybe it changed but i'm not holding my breath, even if IBM seems much more aggressive on pricing with P8 then they were with P5-P7. The
Quick calculation, absolutely unscientific:
Seeing that the price is 0.14$/hour for the 6-core xeon and 1.08$/hour on the 176 core P8, it would have to be roughly 8-10x faster to justify the cost difference, not sure it will be the case.
the thing you're getting here is primarily throughput on a single image. Even if it's more expensive per-core per-hour, you can't discount that you'd have to work a lot harder to get the equivalent 30-box distributed solution to work properly, and even then it would have certain disadvantages owing to network latency.
This is interesting and I'd like to hear more opinions on it. My impression is that distributed computing has been eating Power/Sparc/Z processors' lunch for a long time now because software has made up for the deficiencies of coordinating 30 boxes. Do you and do any others believe that we are at an inflection point where the pendulum swings back in the direction of 'high-performance' processors like Power8, or will improvements in 'scale-out' ease-of-use and economies of scale continue to win the day?
The dominant use case for the last decade or so has been web servers hitting caches to do low-CPU low-causality CRUD operations. That looks unlikely to change in the next decade, so keep your Intel stock.
That said, for a lot of interesting use cases, like that king-hell postgres database sitting in the middle of the swarm, or video processing, or streams processing, or indeed any situation in which thousands-to-millions of simultaneous actors need to work on the same shared state, this sort of system starts looking real interesting.
As a thought experiment, think of this system like a GPU, except every single processor is a fully capable 2 GHz i5 running Unix, and instead of having to deal with the CUDA or OpenCL API, you can just write erlang (or haskell; .. or whatever) code and it will run. And instead of having 2-8G of RAM, you have 48G. And instead of having arcane debug tools, you have recon and gdb and ddd.
I don't think there is a pendulum, I think there's a spectrum and has always been one; pragmatism should always rule, and your use case is not my use case. There isn't going to be an objective winner ever, no matter how close Intel may get to covering much of the sweet spot.
I would be extremely surprised that you would need 30 x86 boxes to reach the performance of a P8 box, on any type workload. By my experience with P5-P7 they can be faster for certain workloads then x86, but not that much.
You can't really compare the different chip revs apples-to-apples. P6 was a completely different chip architecture with much higher clock speeds that IBM abandoned because it didn't perform well. They make a lot of changes in each chip rev.
It is not just about Erlang, but any language+runtime which has been designed upon a well-known set of sound principles (immutability, share-nothing, message-passing).
As long as order of evaluation does not matter (for a pure-functional code) some [major] parts of a program could be evaluated in parallel by runtime without any changes in the code (especially when a high-order function composition - map/filter/reduce/etc is the primary pattern). So, Haskell, for example, will do it too.
By "will do it" you mean "would, in theory, support a compiler that parallelized significant amounts of idiomatic code with no source changes"? Maybe. I don't think any existing Haskell compiler does that, though.
Haskell will do what? Automatic parallelization of code?
The big difference between Erlang and Haskell is that Erlang fundamentally encourages the programmer to think in terms of individual processes that communicate by message-passing. Programmers have also been taught that spawning processes is cheap so idiomatic Erlang code typically have tons of them.
I love my haskell brethren dearly, but haskell isn't organized structurally around an essentially-mandatory fundamentally concurrent, debuggable, quasi-preemptive core like OTP. Not that it won't get there eventually (cf. Cloud Haskell) but it's more than just being able to run concurrent threads. Note: this is not a language flame, just a personal opinion/observation, and I love and welcome correction.
Haskellers like to pick and choose their abstractions. That's why we have MVar and TVar (STM) and IVar and locks/semaphores and unbounded channels and bounded channels and software transactional channels and green threads and Async and OS threads…
You're going to see what Erlang did atomized, implemented piecemeal, and reconstructed. Cloud Haskell burnt too much goodwill in pursuit of a mostly pointless feature.
That's how we ended up with slave-thread…the beginnings of supervisor trees.
Haskellers would rather pick and choose. It's worked well so far.
You are right. Haskell is an academic research vehicle while Erlang is a pragmatic "industry-strength" (i hate these buzzwords, but they are useful generalizations) functional language.
In theory, we could re-write almost any part of OTP in Haskell, in practice, however, "runtime is hard" and the biggest selling point of Erlang is that its runtime and standard library (OTP) has been evolved for almost 20 years and is based on the same principles of a functional language (except, perhaps lazyness by default) as Haskell.
In other words, while Haskell is a "playground for researches of type theory in context of programming languages and compilers" Erlang/OT is a "telecom industry-standard toolkit".
A close analogy, in my opinion, could be the pair of Scheme/Racket and CL. The first was mostly an academic teaching and research tool, while second is a pragmatic (but unfortunately bloated) tool-set.
Otherwise the x86 comes out looking a lot better -- lower 95th percentile, lower max, lower average.
(Modulo usual complaints about benchmark porn: single run, lack of standard deviation, unknown configuration differences etc etc).