ARM v8-A with Scalable Vector Extensions: Aiming for HPC and Data Center

dogma1138 · on Aug 22, 2016

Intel losing the HPC space to NVIDIA(GPU's) and now ARM they can't be happy.

Don't know how "profitable" the HPC market at least nation-state level supercomputing is, at least directly. But the indirect gains from it, prestige, contacts, and technology transfer during the codevelopment process are likely to be quite huge.

If AMD actually hits Intel where it hurts with Zen and we don't get a Bulldozen at the end like we got with well Bulldozer; Intel might be up for a rough ride, especially if it's 3D Memory technology which seems to be now the "next big thing" at least as far as Intel's development goes doesn't pans out that well.

semi-extrinsic · on Aug 22, 2016

Intel is in no way losing the HPC market. There are lots of algorithms (like all incompressible CFD, so all car/boat design, even sparse matrix multiply) that gets essentially zero benefit of GPU. At most, I'd say half the popular HPC use cases see significant GPU benefit.

Karl Rupp has a very good series of CPU and GPU performance comparisons, recently updated:

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-ch...

Most of the "no GPU benefit" stems from algorithms having low FLOPS per byte. Look up Collella's Seven Dwarves for more details.

And even if we assume GPUs become dominant, Intel is going to sell at least one Xeon for every two Teslas; those GPUs need host CPUs.

threeseed · on Aug 22, 2016

HPC space isn't that large but the related space of Big Data Analytics (incorporating Machine/Deep Learning) is massive. Every enterprise has a program in place to drive value from their data. And many are running programs in house due to privacy/security concerns i.e. they are purchasing lots of hardware.

Hadoop/Spark/Storm etc. which rely on spinning up lots of JVMs could work nicely on this ARM platform especially given that most jobs involving iterating on rows of data which the JVM will automatically leverage vector extensions to speed up.

dogma1138 · on Aug 22, 2016

Yeah I was more talking about the traditional HPC/Supercomputer level of engineering where CPU's still kinda rule, or at at least as important as GPU's.

In the more "smaller" enterprise level HPC market NVIDIA is kinda ruling today, there is almost no point of building a traditional CPU based HPC platform if you are an enterprise (or even a small/medium size education/research institution).

Intel is trying to get into that business with it's Xeon SOC's and Xeon Phi compute hardware but I don't know how much traction they are getting, NVIDIA probably by now pretty much controls all of that market and Intel and AMD are more or less rounding errors.

ARM can really give NVIDIA a fight if they can provide semi-custom ASIC's for HPC you can easily shove 100's of cores into a single server making it potentially comparable to NVIDIA's HPC offerings.

arcanus · on Aug 22, 2016

You seem bullish on ARM and dismissive of Intel or AMD, not sure why.

NVIDIA is certainly ruling the roost because of CUDA+deep learning, but advantages in chips don't tend to last, and I do not see why this time would be any different. Could be a very interesting few years in the space.

jdietrich · on Aug 22, 2016

As the giant in the room, Intel has further to fall. Intel's market cap is more double that of AMD, ARM and Nvidia combined.

Intel's real competitors are TSMC and GlobalFoundries, who are architecture agnostic. TSMC's market cap is ~90% of Intel's; GF are private, but they're probably in the same ballpark.

Intel is in a unique (and uniquely challenging) position as both an IP and fab company. This vertical integration has proved highly profitable for Intel, but it is also highly risky. Their recent decision to start manufacturing ARM chips points at this uncertain future - diversification is a high priority if they are to reliably recoup the costs of <10nm fabs.

The desktop market that Intel dominate is dying a slow death. Intel has failed miserably in the mobile market and are under threat on several fronts in the server and HPC market. x86 has been a tremendous asset for Intel, but it could easily become a hindrance. Intel is really two very different businesses (IP and fab) that are shackled together.

dogma1138 · on Aug 22, 2016

I'm not dismissive of Intel, just currently the smaller scale HPC market is focused on GPU's. And in the GPU market as far as compute goes NVIDIA is the champion.

Intel is great but their current HPC/X86 offerings aren't that great. Xeon Phi and the Xeon SOC's might eventually come out swinging, but currently they aren't.

AMD on the GPU side is pretty stagnant it has almost no answer to the Tesla based servers, and on the software side NVIDIA's software stack which includes CUDA is simply better. On the CPU side we need to see where Zen leads, but even then AMD will have a competitor to the traditional Xeon platform rather than HPC.

ARM is probably more scalable especially since you can easily create semi-custom SOC's for HPC and put a lot of pretty powerful and considerably smaller cores in a server. Intel may get to that point with the Xeon products but it is not there yet.

semi-extrinsic · on Aug 22, 2016

I think you're only seeing a small part of the smaller scale HPC market, probably the deep learning or computational chemistry?

If you go ask people working on car aerodynamics or geophysics or reservoir modelling for the oil industry, they're mostly all CPU cluster users still.

gnufx · on Aug 22, 2016

It's not even right for computational chemistry, let alone all the other things you run on a typical university system, say. (I don't see why the size of the system is relevant, and obviously some GPU capacity may be sensible.)

pjmlp · on Aug 22, 2016

Tooling and supported languages also play a big role in CUDA's adoption.

SixSigma · on Aug 22, 2016

Intel just announced they will be fabbing ARM chips

http://www.bloomberg.com/news/articles/2016-08-16/intel-lice...

dogma1138 · on Aug 22, 2016

Intel is fabbing a lot of things that aren't necessarily their "IP", they like to keep their Fabs busy and running at full capacity even the older ones. Arguably Intel is still making the best Fabs out there, their process tends to be the best in even when compared to TSMC/GoFo and IBM and they are very good at optimizing their entire fabrication process (the non-lithography side) so their Fabs just tend outperform the competition.

Tho that said Intel doesn't want to morph into a Fab as a service company, they just don't want to leave money on the table if they can't fill their 14nm and 10nm Fabs with their own IP/silicon they build it for some one else. It's even more so important now since Intel's dies are pretty darn big even as it comes to 14/10nm optimizing processes on smaller dies can be quite valuable to them. And considering they are dropping their "Tick Tock" methodology or at least altering it somewhat their Fabs can outrun their product line which is never good.

mtgx · on Aug 22, 2016

Which only means ARM chips will become even more competitive with x86 chips. Intel seems to want to be in the "dumb chip manufacturer" market (kind of like how ISPs are "dumb pipes") - because if this business pans out for them, then the x86 business will struggle even more as a direct consequence of it. x86 was barely competitive with ARM at the same performance and power consumption level when Intel was using its manufacturing edge against ARM chip makers. If that gets equalized, then ARM chips should win big against Intel's own chips.

However, I expect Intel to try and shoot itself in its own foot in this business, for the same reason they killed Xscale previously. Because Intel is at a cross-point now (pun intended). They have to make a choice - either they "go big" with ARM chip manufacturing, or they try to protect the interests of the x86 division. They may put all sorts of caveats on ARM chip customers that want to buy Intel's manufacturing, so their chips can't compete directly with Intel's own x86 chips. But that will significantly hurt the potential of its ARM chip manufacturing business - so then may have to ask themselves, why are they even bothering with ARM chip manufacturing, if they're not going to go all-in on it?

So either Intel repeats its Xscale mistake by protecting the x86 chip business, and risks being completely left out of the mobile/IoT markets forever (in any capacity), while also losing a few more billions of dollars with this "ARM manufacturing experiment". Or it goes all-in with ARM chip manufacturing, manufacturing anything from IoT ARM chips to Xeon ARM competitors for anyone from MediaTek to Qualcomm, and then its x86 risks a severe decline over the next decade.

Either way, Intel has to make a choice. They can't have their cake and eat it, too, no matter how much they'd wish that to happen right now.

dogma1138 · on Aug 22, 2016

Intel "dumped" Xscale on Marvell but it didn't stop manufacturing them, Intel is still manufacturing for Marvel and for many other companies. Besides Marvel Intel has been manufacturing FPGAs for Altera, and at least 5 other companies are using Intel Fabs to make their own silicon (Achronix, Tabula, Netronome, Microsemi, Panasonic).

http://blogs.wsj.com/digits/2013/05/01/microsemi-emerges-as-...

In fact one can argue about the brilliant move Intel made here, Intel opens up it's Fabs to 3rd parties, various companies including the leading FPGA designers jump onto it. 3-4 Years later Intel starts exploring integrating FPGAs into it's Xeon CPU's and SOCs, that's not coincidence when you fabricate something you have technology transfer even if it's completely "unintentional", furthermore Intel gains a lot of experience in fabricating and optimizing FPGAs silicon and they do it on some one else's dime while being paid heftily for it.

So I don't really think your conclusion of Intel's strategy is correct, Intel can very well keep the cake and eat it, Intel makes ARM CPUs and SOCs Intel gains experience and knowledge in producing and testing new silicon, Intel gains access to technologies that it could later implement in its own line of X86 CPUs to further optimize their microarchitecture for multiple cores, asymmetrical CPU designs, power efficiency etc.

So I don't see Intel being at a crossroads, they don't have to go big with ARM or go home, they just need to get enough ARM chips out the door to be profitable and considering Intel's fab efficiency it's not hard for them to do, and in the meantime gain enough knowledge and experience to improve their own IP and designs. Intel has made the bet once that I would take ARM just as long to get to X86 performance levels as it would take Intel to get X86 down to ARM's power levels and they made the right bet at the time, and I'm pretty sure they also made the right bet this time around also.

pjmlp · on Aug 22, 2016

Still I would like for them to keep improving the x86, otherwise we will eventually just switch our chip overlord.

pcwalton · on Aug 22, 2016

AArch64 is a very clean ISA. From an architectural point of view, I wouldn't be upset if AArch64 eroded away x86-64.

Of course, I don't want to sacrifice competition. It'd be awesome to see Intel and ARM competing in features, performance, power, and price on a simple target for compilers and programmers.

(Yes, I know as a practical matter x86-64 will never go away; it's simply that I like the direction the market is going.)

mtgx · on Aug 22, 2016

XPoint is not an Intel-only technology anyway. At the very least Micron will be selling it, too (as QuantX memory), so AMD or others could buy that, if needed. Micron also tends to price things significantly cheaper than Intel does, so that ought to help as well.

pella · on Aug 22, 2016

with code examples:

"Little ARMs pump 2,048-bit muscles in training for Fujitsu's Post-K exascale mega-brain"

http://www.theregister.co.uk/2016/08/22/armv8_scalable_vecto...

monocasa · on Aug 22, 2016

Those code examples are all NEON or no vector ops. No SVE unfortunately.

0xFFC · on Aug 22, 2016

I would love see more attack from ARM toward Intel in different sector. I would love to see ARM SoC for desktop users.

Intel is kinda company which is used to ripping off people without serious competitor.I really like the idea of competition in CompArch area.

Lets be honest, before smartphone raise, Intel was socking people's blood and in most area, they didn't have any serious competitor.

I know ARM was selling small chips, but AFAIK (I might be wrong) those market is not even close to servers,desktops, etc market's which Intel was king for long time.

SixSigma · on Aug 22, 2016

Maybe this is part of the reason Intel decided to re-license ARM and get fabbing

http://www.bloomberg.com/news/articles/2016-08-16/intel-lice...

cesarb · on Aug 22, 2016

Does anyone know how this compares with Hwacha or the still unreleased RISC-V Vector extension?

pella · on Aug 22, 2016

https://community.arm.com/groups/processors/blog/2016/08/22/...

ericvh · on Aug 24, 2016

More details in the upstreamed binutils support: https://sourceware.org/ml/binutils/2016-08/msg00166.html

g0xA52A2A · on Aug 22, 2016

Anyone have details on the cache design for this? 2048 bits is much larger than most cache lines in chips today.

beautifulpeople · on Aug 22, 2016

Looking at the article, I'm not sure 2048-bit lines are quite needed (not to say that that wouldn't be interesting), from (http://www.theregister.co.uk/2016/08/22/armv8_scalable_vecto...): "And once a program has been built for SVE, it will run comfortably on any SVE-capable processor without recompilation, whether the CPU has support for 512, 1,024 or the full 2,048 bits. The SVE unit can automatically break a 2,048-bit vector into, say, four 512-bit vectors if its silicon implementation doesn't support the full length." This paragraph from The Register implies that you could have smaller chunks or larger depending on the silicon implementation. If you look at a 64-byte cache line (what most architectures have today, power & itanium are notable exceptions) that would mean 512-bits per line (assuming you can use the whole line, i.e. packed). For 2048-bit that means 4 cache lines worth of data could potentially be operated on at once.

beautifulpeople · on Aug 22, 2016

Also, remember that we're all DDR burst length limited. Unless you change the DDR model, most are 8 bytes x N-length so you'll only get 64-bytes unless you can get the standard changed. If you think about re-use of cache lines though, for most apps part of the data will be in cache and part of it will not be. so likely 64-byte lines will still work well for most applications, it'll be interesting to see what ARM and their partners do going forward. Intel it seems has gone the HMC route, wonder how many others will follow the same route.

gpderetta · on Aug 22, 2016

Sometimes these large vector architectures are fed directly from L2 or even LLC, whose line size does not necessarily need to match L1.