The Future Google Rackspace Power9 System

nl · on April 11, 2016

TheNextPlatform is a pretty bad site. They rehash - badly - information available elsewhere, and add a hyperactive spin on it all.

Here's the truth:

Google uses lots of compute power (insightful!)

Google isn't shifting to Power.

Google does have an active R&D program looking at Power.

TheNextPlatform misses the whole point here: That Zaius board has 32 DDR4 slots (commercially available servers from eg Dell max out at 24) and it has 2 NVLINK slots! (!!)

Those NVLINK slots are what Intel should be worried about, because that's where Google is prepared to pay money. They are building computers that lock themselves into NVidia and doing it gladly.

Intel better find a way to compete with NVidia on deep learning.

exhilaration · on April 11, 2016

For anyone else wondering what NVLINK is, some links to save you a Google search:

https://blogs.nvidia.com/blog/2014/11/14/what-is-nvlink/

http://www.tomshardware.com/news/nvidia-nvlink-boosts-perfor...

I'd be curious to hear what Intel is developing to compete with this.

oneweekwonder · on April 11, 2016

To add to the above, From Wikipedia[0]:

NVLink is a communications protocol developed by Nvidia. NVLink specifies a point-to-point connection between a CPU and a GPU and also between a GPU and another GPU. NVLink products introduced to date focus on the high-performance application space.

and [1]:

NVLink – a power-efficient high-speed bus between the CPU and GPU, and between multiple GPUs. Allows much higher transfer speeds than those achievable by using PCI Express; estimated to provide between 80 and 200 GB/s.

[0]: https://en.wikipedia.org/wiki/NVLink

[1]: https://en.wikipedia.org/wiki/GeForce#PASCAL

bogomipz · on April 11, 2016

Do you have similar links for CAPI? I feel like that along with NVLINK seem to be present in all the POWER literature.

voltagex_ · on April 11, 2016

What's the current state of Power development in the Linux kernel like? I thought it was only IBM holding the fort (via ozLabs) but this could be a big boost.

cpeterso · on April 11, 2016

Why does Facebook's Open Rack use a nonstandard rack size? That seems like an obvious barrier for adoption of hardware that was designed to be a commodity.

DiabloD3 · on April 11, 2016

Racks were designed to fit telecom hardware originally. The Open Rack size is designed around common computer hardware sizes.

It isn't a barrier for adoption because swapping racks out of a datacenter is easy, and they fit on standard datacenter floor tiles.

What is a barrier is that damned 48V.

Disclaimer: I run a hosting company.

rdtsc · on April 11, 2016

> What is a barrier is that damned 48V.

Interesting what is the issue with 48V, I saw equipment for it seemed to be overpriced. Remember pricing out some stuff and as soon as 48V option for power came into play then price rose quite a bit.

Or is it that voltage is not high enough to be efficient for a large data center?

DiabloD3 · on April 11, 2016

The voltage is high enough. The way Facebook's solution is, is you have triplets of racks: left, right racks hold computers, middle rack holds network, power distribution, and UPS.

The computers have extremely simplistic power supplies that basically can't fail, and just DC->DC transform from 48v to 12, 5, and 3.3; and the large scale power supplies that convert three phase 240v (or whatever you're supplying it with) from the datacenter to 48V are much higher efficiency than the ones that would have been in the server (which you would have fed them, usually, something like single phase 208v).

Redundancy is supplied by just hooking multiple transformers in the middle rack to the + and - terminals on each PSU, instead of a convoluted multi-module redundant PSU (which always uses a single backplane, and backplanes in redundant PSUs fail surprisingly frequently).

The total round trip efficiency of this system is about as high as you can realistically get. 80Plus Titanium is 90-95% efficient (depending on load), but has efficiency losses in rack level distribution which 48V tries to correct.

However, 48V DC can be very dangerous to work with, and a lot of tech workers refuse to work with it. Now, if you believe it is dangerous or not (I've seen arguments stating that it is no more dangerous than single phase 208v) is immaterial, this is the opinion of a lot of workers.

The cost of 48V gear is expensive if you're not in a datacenter already setup to handle it. Facebook obviously doesn't have this problem because they build entire datacenters from scratch.

I personally don't believe in it because it doesn't buy me anything that single phase 208v doesn't give me, I do not pay enough in electricity to have the overhead of dealing with it.

patrickg_zill · on April 11, 2016

It is not the voltage it is the high number of amps that DC systems have.

While you could get fried from a single rack server's failing power supply, it is likely that it will die of some other cause before zapping you if you take simple precautions.

DC on the other hand, puts "the forces of nature" up close and personal to the back of the rack.

A wrong move will vaporize the thickest of metal screwdrivers and can easily do irreparable damage to a human.

Something as simple as not removing your wedding ring can result in an inadvertent bridge between + and - ; after which, bad things happen.

DiabloD3 · on April 11, 2016

Correct, 48VDC exchanges volts for amps. Basic electrical theory states volts * amps = watts. We measure computer power usage in watts for a reason.

48VDC isn't sufficient to vaporize a metal screwdriver at the amperages used in most 48V datacenters, however, I know people who work with higher voltages than that, and they own expensive ceramic tools just for safety reasons. If I personally ran 48VDC, and my workers requested ceramic tools, I would not hesitate to expense those.

And yes, if you're working around DC >12v, you should be seriously using every method available to make sure you don't become a conductor, including only using one hand at a time to touch power relays (to avoid having your heart stopped).

raverbashing · on April 11, 2016

48V danger is debatable, but it might be a good idea to start adding RCD protection to the power sources (yes, this will bring your rack down (or just one output), but better than having people getting stuck to wires or tools causing short circuits)

jermy · on April 11, 2016

Is this possible? Fault currents are much more obvious at a higher voltage than lower, and I thought most RCDs would not be made to be sensitive enough to catch what might be a dangerous DC leakage.

raverbashing · on April 11, 2016

Yes (not sure if there are existing products), because they work based on the difference of currents leaving and returning. (between 48V and 110V there isn't an insurmountable difference in detection capability needed)

Theoretically you should have zero leakage, and you should also trip on bigger currents (but a rack wouldn't use more than 10A maybe?)

bogomipz · on April 11, 2016

RCD? Yeah, its usually the AMPs that kills you not the voltage.

bluedino · on April 11, 2016

Very similar to why the automotive industry is reluctant to switch from 12V to 48V

jzwinck · on April 11, 2016

Swapping racks may not be possible if you don't own the racks. Lots of datacenters are built with their own racks. You might be able to ask for extra deep racks but that's it.

DiabloD3 · on April 11, 2016

It isn't about rack depth, but rack width. They are deep, but not unusually so.

Datacenters are not "built" with racks. They are not permanently affixed to the floor. Most datacenters do not keep empty racks on the datacenter floor, and keep the floor open.

If you are renting by the rack, no, most datacenters won't swap racks for you. You'd need to be renting entire cages for them to consider it.

Open Rack isn't really useful for small scale providers, it's more use to hyperscale companies like Facebook, Google, Amazon, and Microsoft. I don't think anyone we'd consider "medium scale" has adopted it (if I'm wrong, I'd love to see a story hit the front page about it).

Only one of those companies have adopted it, the others have considered it and, although they manufacture custom hardware and thus could take advantage of it easily, they have not done it.

StreamBright · on April 11, 2016

I think Amazon does not want to use OpenRack, not sure if they consider it. I think they are in the position to negotiate good price from normal x86 server vendors or make them produce a modified version of a normal server (without the things Amazon does not need). I am curious what is up with them nowadays.

bogomipz · on April 11, 2016

Can you explain how and why 48V is a barrier to begin with?

bluedino · on April 11, 2016

The Open Rack’s equipment bay has been widened from 19 inches to 21 inches, but it can still accommodate standard 19-inch equipment. A wider 21-inch bay, however, enables some “interesting configurations”, like installing three motherboards or five 3.5-inch disk drives side-by-side in one chassis. The outer width of the rack has remained a standard 24 inches to accommodate standard floor tiles.

ksec · on April 11, 2016

They are going up against the coming Xeon E5 Broadwell + FPGA. Power9 do offer more memory per Rack. But I dont see how Intel cant adopt with better memory controller.

To simply put, what are the incentive to switch over to Power9 platform?

dman · on April 11, 2016

Having viable options to Intel would be very helpful for Google when negotiating bulk rates for the Xeon processors.

thrownaway2424 · on April 11, 2016

Yeah exactly. Google waves these around from time to time to have something over which to negotiate.

bogomipz · on April 11, 2016

Kind of like how Google does with Google Fiber? They seem to engage in this just enough to keep carriers in in limited markets but not much more.

ksec · on April 11, 2016

If Zen was as good as the rumored, then waving a AMD Zen Server processor would be much more useful.

dogma1138 · on April 11, 2016

Why? Intel can compete with Zen more easily it can also always rebound even if Google does go with Zen.

But if Google switches to an entire different eco-system dragging it back into x86 won't be easy because all of their platforms are built for a completely different architecture.

bogomipz · on April 11, 2016

Whats the Zen offering? This is the first I've heard of it. Unfortunate name considering theres already Xen.

ksec · on April 11, 2016

Next generation uArch from AMD. Which promise 40% IPC improvement . Again it won't be as fast as Intel, but at least AMD is within Reach for a price war, where as now even if AMD is 50% cheaper it makes little sense to use them.

petra · on April 11, 2016

So this raises all sort of questions: Can Intel can be fast enough in integrating Altera(sw+hw+corporate...) ? What is the better FPGA development environment, with more developer share, etc ? FPGA's can be cannibalistic to Intel's business - will they have an incentive problem ? Do some companies(say in china) prefer an open processor, like POWER, and this will create some ecosystem advantage ? Are there any advantageous startups to buy like kandou-bus(faster interconnect) and who will buy them ?

So it's not certain Intel will win.

ksec · on April 11, 2016

Yes, and I think Intel is not certain to win, just much more likely. The Power9 is here is targeting 2H 2017 release. Which is actually up against Intel Skylake/Kabylake Xeon Purley Platform in similar timeframe.

Purley Platform, Skylake Xeon offers:

Up to 8 Socket and 28 Core per Socket 6 Channel Memory Controller, 12 DIMM per Socket Support of Intel Xpoint NVDIMM 48 PCI-E Gen3, OmniPath 100G Connection

Offers up to 1.5TB Memory on a 2S Server, or 6TB Xpoint. If you push to the limit of 128GB TSV DRAM and 512GB Xpoint, that is potential of 3TB on 2S Server and 12TB of Xpoint.

Not to mention Intel's Network Controller. The Whole ecosystem from Intel Cloud is actually quite amazing. Both from Hardware innovation and Software Compiler they are working on. It is the same lock in as the PC Windows industry, and unless you get a dramatic new way of doing things. You cant simply switch the Mobile Industry to x86 or vice versa all by yourself. Even if you are as big as Google. Then you get 10nm Intel Server in 2018/2019.

Again, I dont see the incentive making the switch.

coredog64 · on April 11, 2016

I don't have the link handy, but within the last month or two there was a story here on HN about a company (Facebook?) that determined it was better to go small on hardware. Which is to say that one or two socket servers were more of a sweet spot than 8 socket monsters, at least for generic loads.

ksec · on April 11, 2016

That was Facebook. Again backing up the same reason why ARM didn't manage to penetrate Server Market yet. The story was about Web Server which needs not a lot of CPU power, decent amount of memory and good networking. When anyone think of low power CPU they immediately think ARM has a fighting chance here. Before ARM even got a foothold in the Server market Intel responded with an Atom Server CPU. It turns out the market needed a more powerful CPU then they thought. Intel came again with Xeon-D, with has 2 to 8 Core, and integrated 10gbps Ethernet support at a low power consumption. It was an instant hit and it is now selling like hotcakes. They have recently updated with even 16 Core. You can stack 8 - 10 of these in 2U Microblade.

Again when you take into account of power, performance, and other parts of the Server components, the TCO of CPU is relatively small. Even if you gain 10% of TCO improvement, you have to factor in the future roadmap of the CPU, as well as the Software development, compiling and testing cost involved.

homero · on April 11, 2016

Fpga is trash compared to asics

Symmetry · on April 11, 2016

Only to the extent that you can afford to spend $1 million+ and a year every time you change your algorithm. For bitcoin mining or encryption or decoding popular video formats then yes, ASICs are absolutely the way to go. But there are many cases where the algorithms you're using aren't so fixed or where you're not willing to put up with such large lead times.

sspiff · on April 11, 2016

Depends on what you are trying to do.

If you want to run an entire CPU on an FPGA, ASICs will clearly be better. FPGAs are mainly useful for things that need to be (re)programmable.

For things like programmable network routing/packet filtering, FPGAs could be a very effective solution.

cm3 · on April 11, 2016

If one has the financial option to diversify, one would be wise to use x86, ARM and POWER at the same time. There aren't many examples where monoculture has been beneficial to anyone but the artificially selected culture.

bogomipz · on April 11, 2016

Broadwell + FPGA? Is there an Intel design that has both? Can you clarify?

Also what is the issue with the memory controller?

ksec · on April 11, 2016

Yes, First design will be out this year. A simple Google Search should bring your lots of info. It will be separate die but on same package. A true integrated single die solution is on track for 2017.

Power8 or Power9 has better memory controller then Intel Xeon. More memory channel, higher bandwidth, and higher memory capacity.

bogomipz · on April 11, 2016

Thanks!

vegabook · on April 11, 2016

would be interested to hear more about this IO issue. Am I right to assume that because of the genesis of the X86 architecture in desktop computing, it is not optimized for server-class IO, and that this permeates the design (ie difficult to catch up with a ground-up server architecture)? If that's true then this is a big deal for Power. Certainly my big-data workflows are usually memory-IO bound, not compute bound.

virtuallynathan · on April 11, 2016

I wonder if the inclusion of NVLink in Power 8+ will cause Power to excel in ML applications. It could well be quite a bit faster than x86 just due to the memory/interconnect bandwidth.

PeCaN · on April 11, 2016

NVLink and CAPI[1] both have huge potential for machine learning. However, a lot of the benefits of NVLink for ML come from GPU-to-GPU NVLink, which doesn't require CPU support.

1. CAPI doesn't seem to get mentioned to much around here, but imagine an FPGA directly accessing some shared system memory. It's neat.

mikehollinger · on April 11, 2016

Yeah, it's neat. (I work on stuff that exploits this). We open-sourced the software side of our first flash IO accelerator last year. [1]

You can do some pretty cool things from a HW designer's perspective inside the accelerator, and in the main application. Since the accelerator is cache-coherent, and able to map the same virtual addresses as a given process (and attach to multiple processes' address spaces) the device can do "simple" things like follow pointers, which used to require building a command / data packet, DMA'ing it to the device, and then waiting for a response packet. This, effectively, frees up the main CPU to do other things, rather than wrangle data. It also means that bottlenecks move.

[1] https://github.com/open-power/capiflash

bogomipz · on April 11, 2016

So the idea is to present nand a memory device rather than a block device?

DiabloD3 · on April 11, 2016

It does seem to require support in the PCI-E host controller, which for both modern Intel and those POWER machines, is on die on the CPU.

So, it "requires" CPU support, just not in the way usually meant.

mikehollinger · on April 11, 2016

Correct; the CPU and the end-point accelerator both must cooperate to negotiate the CAPI link.

Disclaimer: I work on this with some very smart people @ IBM. Opinions are my own.

When a PCIe device is in CAPI mode, the PCIe protocol is used as a transport layer, but the CAPI protocol rides on top, and hardware in the CPU's PHB (the CAPP unit) and hardware in the accelerator (the PSL in this case) cooperate to present the common address space to the process and to the accelerator itself. [1] If a CAPI-capable card's plugged in to a non-CAPI-capable slot, it remains a PCI card. If a non-CAPI card's plugged in to a CAPI-capable system, it remains a PCI card. If both sides match on protocol versions and the kernel contains the cxl driver, the kernel will switch the slot into CAPI mode, and the CAPP unit and PSL effectively take over the PCI link on either side.

[1] http://events.linuxfoundation.org/sites/events/files/slides/... - see page 13+ for some GPU / NVLink materials, and page 24+ for CAPI materials (oh - and I worked on the product who's data is quoted on page 29 [2]!

[2] https://www.ibm.com/developerworks/community/blogs/fe313521-...

Symmetry · on April 11, 2016

For huge datasets the GPU-CPU links might become more important with Pascal now that the GPUs are allowed to trigger page faults.

Symmetry · on April 11, 2016

Thinking about this some more no way would the latency of PCIe ever be as big as that of paging in some memory from disk. So page faults can't really make this more important in Pascal.

bogomipz · on April 11, 2016

Can you elaborate on CAPI/NVlink would be beneficial to an ML workload?

PeCaN · on April 11, 2016

NVLink is similar to a higher-bandwidth PCIe connection, except that multiple GPUs can be connected with it. It's primarily useful for very large convnets, which use a lot of memory and can be bandwidth-limited. It doesn't require any particular modifications to a model or framework to take advantage of it.

CAPI is much more flexible and interesting. It allows a CAPI-capable connected device access to a process's virtual memory. Essentially, you can extend the CPU's capabilities with CAPI. Usually this would be an FPGA (and the utility of FPGAs for machine learning is very much a research topic), but I could easily see a DSP being useful for voice recognition. GPUs can take advantage of it too, but ML work is usually just offloaded entirely to the GPU.

CAPI is very very cool and designed by some very smart people. I'm excited what people will do with CAPI and FPGAs.

bogomipz · on April 11, 2016

Oh is CAPI an onboard FPGA with a memory controller?

ajdlinux · on April 11, 2016

CAPI allows an FPGA connected via PCIe to be treated as a coherent peer to the CPU cores that is able to hold cache lines and also use address translation. Among other things, from the application programmer's perspective, the CAPI accelerator can basically be treated as if it were another thread, since it can use the application's virtual address space - the application can set up data structures in main memory and pass unmodified pointers to the CAPI card.

http://www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pd... is a good intro.

[Disclosure: I work on CAPI at IBM]

bogomipz · on April 11, 2016

Thanks for the link. The papers mentions key/value stores. Would a valid use case for CAPI be something similar to a "flash cache" where the FPGA is not as fast as DRAM but still faster than NAND flash?

bluedino · on April 11, 2016

IBM reps love to throw around the "Google is switching to IBM" line. Can they possibly compete with IBM on price? Why isn't AMD trying to reach this market?

bryanlarsen · on April 11, 2016

"Why isn't AMD trying to reach this market?"

https://en.wikipedia.org/wiki/Zen_(microarchitecture)

rdtsc · on April 11, 2016

AMD would still have an AMD64 architecture though? Or are you thinking they should come up with a new competing architecture.

derefr · on April 11, 2016

These POWER chips are an open design; thus, anyone who wants Google as a customer and has a chip fab ready to go (like, say, AMD) could, in theory just fab some up to sell to Google.

WoodenChair · on April 11, 2016

AMD doesn't have a chip fab - they are fabless now.

monksy · on April 11, 2016

Oh wow. Is there a good reason for this?

snuxoll · on April 11, 2016

Maintaining a fab when you can't justify enough orders for chips to keep it running around the clock is expensive. As part of the spinoff they had penalties because they weren't purchasing enough wafers from GF anyway, but it resulted in less bleeding than owning and managing it as a subsidiary.

mgo · on April 11, 2016

They're broke and they need at least $750MM to $1B in the bank to operate effectively as a company.

ajdlinux · on April 11, 2016

Neither does IBM as of last year - all IBM POWER chips are fabricated by GlobalFoundries.

warrenm · on April 11, 2016

Are you sure? Global Foundry is owned by AMD, no?

_delirium · on April 11, 2016

No, it was spun off. AMD initially retained a minority stake, but sold their remaining shares a few years ago.

xiaopingguo · on April 11, 2016

Wasn't Google at one point all about commodity/consumer level hardware for their servers? Seems a huge turnaround.

sargun · on April 11, 2016

This is still largely at commodity prices / performance points. It's been quite some time since any of their hardware has looked consumer-oriented, but comparing this to what enterprises buy, it's apples and oranges.

[1] http://shop.oreilly.com/product/0636920041528.do

[2] http://research.google.com/pubs/pub35290.html

transfire · on April 11, 2016

I am surprised. I thought 64-bit ARM was the newness headed to the server farms.

wyldfire · on April 11, 2016

ARM instruction set is pretty mature but the system architecture for servers is less so, IMO. I think there's little commonality for bootstrapping the various SoCs.

DannyBee · on April 11, 2016

It will be there eventually. It is definitely not there now, despite what some may have you think :)

StreamBright · on April 11, 2016

Do you have some data to back this prediction? What is the biggest advantage over x86 server processors?

DannyBee · on April 11, 2016

Which, that it isn't there now? Or that it will be there eventually?

The x86 server processors have too much legacy they can't get rid of, and that limits how far they can push it.

bogomipz · on April 11, 2016

Is there enough juice in ARM chips to power servers and compete with Intel offerings like Skylake/Haswell?

mozumder · on April 11, 2016

Kinda amazing that they can fit 2 Power9's as well as 2 FHFL PCIEx16 slots, along with 15 drives and 2TB of memory in 1 rack unit.

nickpeterson · on April 10, 2016

Hey Google, sell these to other companies :)

chronid · on April 10, 2016

Well, these are not power9, but in theory... :P

http://www.penguincomputing.com/products/rackmount-servers/o...

nullc · on April 11, 2016

Anyone have any idea on the rough prices of these systems?

Or is it the "if you have to ask, it's too much for you" as seem to be the case with the IBM power systems?

loeg · on April 11, 2016

Talos plan to, if demand allows, sell you a bare bones POWER8 (CPU, heatsink, and mainboard) for $3,700 USD. https://raptorengineeringinc.com/TALOS/prerelease.php

nullc · on April 11, 2016

Sadly that's $1000 more than they were originally talking about; still low compared to IBMs prices, but much harder to justify for most applications.

loeg · on April 11, 2016

Yes, very hard to justify when a high-end 4-core Xeon outperforms on some benchmarks[0] and costs half as much or less. Not to mention vastly mature open source floating point, compiler, etc, support. As well as existing (invalid, but still) programs that assume x86isms.

[0]: https://www.phoronix.com/scan.php?page=article&item=talos-wo...

0xcde4c3db · on April 11, 2016

Most POWER8 stuff does seem to be in the "contact us" category. The only reference point I know of is that Tyan has (had?) a reference platform for 2850 USD, but that's basically beta quality and not intended to be production hardware.

http://www.tyan.com/campaign/openpower/

_delirium · on April 11, 2016

IBM has public pricing for most of their POWER8 servers. The price is high, but public. :) http://www-03.ibm.com/systems/power/hardware/linux.html

ajdlinux · on April 11, 2016

Some IBM Power Systems machines can now be ordered online (at least for US customers) - https://www.ibm.com/marketplace/cloud/big-data-infrastructur....

[Disclosure: IBMer]

mtgx · on April 11, 2016

Wistron will probably sell them:

http://www.anandtech.com/show/10230/ibm-nvidia-and-wistron-d...

crudbug · on April 11, 2016

It will be great if Dell / HP / Cisco / Lenovo others start forging some POWER gear ..

For a platform to succees, it should provide low barrier of entry - may be a low capacity P9 system at a low cost ~ 1K ? Will be a better strategy for OF.

I know Power is targeting cloud computing applications, but IBM should consider low cost entry level gear for getting some market share at the lower level which can transition to higher margin markets .