Intel Announces Knights Mill: A Xeon Phi for Deep Learning

cs702 · on Aug 17, 2016

We badly need an alternative to Nvidia/CUDA for deep learning... but realistically, if Intel wants to make headway in the deep learning market, it must offer hardware that can not only compete on performance with Nvidia, but also work out-of-the-box (that is, without requiring lots of one-off tinkering and tweaking) with popular deep/machine learning frameworks like TensorFlow, Caffe, Torch, and Theano.

There is a lot of software infrastructure being built atop these frameworks, and switching costs are getting higher by the day. No one wants to use some kind of 'non-standard' fork of [name your DL framework of choice] customized for Intel hardware, because such a fork can quickly get stale in comparison to the upstream project.

Intel needs to be both better/faster and drop-in compatible with the popular frameworks.

jlebar · on Aug 18, 2016

Tensorflow has a mode that lets it run on CPUs. I'm sure other frameworks are the same. Isn't the whole point of Xeon Phi that it looks basically like an x86 CPU with a ton of cores? If so there is almost nothing to port, just the kernel launching.

Granted, you do have to bother to make a fast x86 / AVX-512 port of your code. But because the shape of GPUs is so different than CPUs -- GPUs have a more complicated memory hierarchy for one thing -- I kind of doubt that "just run your CUDA code in the Phi" is going to work well for nontrivial examples.

(Disclaimer, I work at Google on CUDA support in clang. Which is awesome, you should try it out. :) Google "cuda clang" for instructions.)

jlouis · on Aug 18, 2016

To elaborate: Tensorflow is clever because it takes your flow graph and cuts it into pieces which it then executes on computational units. Any computational unit will do, really as long as you have a backend for it. This choice makes adaptation to new system much easier.

I believe Caffe and Theano uses the same model, but I didn't study it. There are also some similarity in the model to what OCaml does with the incremental library, though it is not for machine learning.

cs702 · on Aug 18, 2016

"Granted, you do have to bother to make a fast x86 / AVX-512 port of your code."

That is exactly the problem: no one has bothered to make a fast x86 / AVX-512 port for any of the most popular frameworks (at the upstream level, not in some fork), and no one has an incentive to bother, other than Intel. For example, as far as I know, none of the popular frameworks take advantage of Intel's MKL out of the box.

Right now, if you want out-of-the-box high performance, Nvidia hardware is your only practical choice.

duaneb · on Aug 18, 2016

This is also on upstream projects not to lock themselves into CUDA. Yes, it's great, but everyone suffers when there's only one supported API. Even more so when it's closed and locked to a specific vendor, as CUDA is.

joe_the_user · on Aug 18, 2016

I'd love an alternative to CUDA.

The problem is that as far as I can see, OpenCL is in no way that. Basically, OpenCL gives me the impression that the oceans of boiler plate required both make development hard and effectively locks you into a specific vendor also since the boiler-plate is going to be setting things up for one's specific vendor.

marmaduke · on Aug 18, 2016

Nope. I have some substantial simulation code written against OpenCL that runs on Intel OpenCL and NVIDIA without modifications, and rather performant on both of them. The only part of the code specific to vendor is the platform selection, which is one line of code.

OpenCL falls down in terms of standard libraries such as cu{dnn,sparse,blas} but if you're writing everything from scratch it's fine.

joe_the_user · on Aug 18, 2016

I'm an independent developer in the process of choosing a GPGPU library.

I can see simple, comprehensible 20-50 sample code for cuda that does most simple tasks. With OpenCL, I get references to version, boiler-plate, mode with nothing that boils down to simple code.

If you have a simple sample, you should post it here or blog about it.

marmaduke · on Aug 19, 2016

Look at Pyopencl; it makes it quite easy to prototype OpenCL apps in tens of lines of code.

Later, it's straightforward to port to c or c++ if that's your thing, though I find having numpy et al handy even in production code

duaneb · on Aug 18, 2016

Boilerplate is one thing—this is what transpilers are for, among other solutions—but can OpenCL match CUDA on a performance level?

marmaduke · on Aug 19, 2016

Cl is comparable to Cuda where the cuda code doesn't employ nvidia specific primitives.

dagss · on Aug 18, 2016

Interesting...so do you use the vector subset designed for CPUs or the wavefront subset designed for GPUs?

marmaduke · on Aug 19, 2016

I write code which maximizes coalesced memory access and plain old 32 bit floats.. nothing special. The respective drivers do a good job of mapping that onto the hardware given sufficient work group sizes.

pjmlp · on Aug 18, 2016

Only with OpenCL 2.x have they came around and started to support C++ for writing kernels, as well as, a standard bytecode for any other language to target. Which most vendors still don't support.

Whereas CUDA supported C++ and Fortran from day 1, with the PTX support added a few versions later.

Also the debugging tools, from the presentations I have seen, are much more developer friendly on CUDA.

Of course developers rather use APIs that offer more modern experiences than ones still stuck in pure C, with a compiler at the driver level, forcing each programmer to writer the boilerplate to compile and link.

Now it might already be too late for OpenCL in spite of the latest improvements.

visarga · on Aug 18, 2016

Can anyone tell me what exactly is missing from OpenCL to be able to run the primitives of deep learning frameworks? Like, does it not have some kind of operation that is essential for matrix manipulation?

pjmlp · on Aug 18, 2016

Up to OpenCL 2.1 was support for C++ and Fortran, forcing everyone to either use C or generate C code from their compilers.

cma · on Aug 18, 2016

Can companies that don't have agreements/court rulings with Intel implement Xeon Phi ISAs?

rbanffy · on Aug 18, 2016

I think it's fair to assume any significant HPC solution will have to be recompiled to target its runtime environment. Squeezing out every GFLOPS of a part is normal when you get an expensive specialized computer.

In that context, x86 compatibility is less of an issue than the quality of the compilers. As long as you are easier to program than a GPU, you are good.

stonogo · on Aug 18, 2016

I wish this were true, but it's not. Multi-million-dollar contracts have been scuttled because a given customer runs codes that they receive as binaries from a vendor, closed-source, and Company A bid Processor X, but the codes only run on Processor Y...

rbanffy · on Aug 18, 2016

That's interesting. I'd assume a market so performance driven would compete such vendors out of existence.

stonogo · on Aug 18, 2016

The market is performance-driven because it is task-driven. Performance as a requirement, in other words, is a byproduct of the mission. If your task requires software that someone else holds the keys to, you have no choice but to mold the rest of your environment to fit that software.

Thankfully this sort of situation is becoming less common over time, but we're not all the way done yet.

rbanffy · on Aug 18, 2016

It's a big relief to know my assumptions will, eventually, be true... :-)

kartD · on Aug 18, 2016

I agree, AMD is working on making it easy to port CUDA[1], which still doesn't provide a strict alternative, but it's something

http://wccftech.com/amd-cuda-compilercompatibility-layer-ann...

DannyBee · on Aug 18, 2016

Google also released GPUCC (http://research.google.com/pubs/pub45226.html), and got CUDA support into upstream clang.

LLVM also has an AMD GPU backend, and says this thing is built on clang/llvm.

So i suspect it's based on that support :)

valarauca1 · on Aug 18, 2016

There is a Nvidia Maxwell assembler https://github.com/NervanaSystems/maxas

But yeah this isn't production ready

scottlegrand · on Aug 18, 2016

Except that up to now at least, CUDA IMO remains the best abstraction for programming multi-core: subsuming away multiple threads, SIMD width, and multiple cores into the language definition. AMBER (http://www.ambermd.org) literally "recompiled and ran" with each succeeding GPU generation since GTX 280 in 2009. 3-5 days of subsequent refactoring then unlocked 80% of the attainable performance gains of each of the subsequent GPUs. DSSTNE (https://github.com/amznlabs/amazon-dsstne) just ran as well, but it's only targeting Kepler and up because the code relies heavily on the __shfl instruction.

So I honestly don't get the Google clang CUDA compiler right now. It's really really cool work, but I don't get why they didn't just lobby NVDA heavily to improve nvcc. With the number of GPUs they buy, I suspect they could have anything they want from the CUDA software teams.

However, if it could compile CUDA for other architectures, sign me up, you'd be my heroes.

For I'd love to see CUDA on Xeon Phi and on AMD GPUs (I know, they're trying). And if Intel poured the same amount of passion and budgeting into building that as they are pouring into fake^H^H^H^Hdeceptive benchmark data and magical powerpoint processors we won't see for at least a year or two (and which IMO will probably disappoint just like the first two), they'd be quite the competitor to NVIDIA, no?

That said, the Intel marketing machine seems to have succeeded in punching NVDA stock in the nose the past few days and in grabbing coverage in Forbes (http://www.forbes.com/sites/aarontilley/2016/08/17/intel-tak...) so maybe they know a thing or two I don't.

agibsonccc · on Aug 18, 2016

How many people are actually deploying to production though? It seems like its mostly research papers and enthusiasts out there in the wild yet.

Are we talking startups? A lot of startups know python so that would make sense...I'd love to see some actual stories though.

argonaut · on Aug 18, 2016

But being drop-in compatible means supporting CUDA or the CPU interface.

cs702 · on Aug 18, 2016

They could also contribute code to the most popular frameworks, instead of releasing forks like "Intel Caffe."[1]

[1] https://github.com/intelcaffe/caffe

argonaut · on Aug 18, 2016

Yes, but unless the framework authors maintain the contribution (possibly unlikely depending on the situation) they will have to keep maintaining those contributions.

joe_the_user · on Aug 18, 2016

Is there a problem with a system that's CUDA compatible?

CUDA seems to me like the only simple SIMD-type computing system that's fairly straightforward to program and understand at this point.

Drop-in CUDA compatibility seems like a good thing.

argonaut · on Aug 18, 2016

> We badly need an alternative to Nvidia/CUDA for deep learning

The post I am replying to.

jcoffland · on Aug 18, 2016

Since Intel's product is just a bunch of CPUs it should work with OpenCL out of the box.

AIMunchkin · on Aug 18, 2016

It does... Badly... And you used to have to pay for it...

filereaper · on Aug 17, 2016

This earlier thread on HN might be of interest:

Why didn't Larrabee fail? https://news.ycombinator.com/item?id=12293308

kartD · on Aug 17, 2016

Not a very good one, I feel it covers up too much of Larabee's past. I think it was well established that it would be a GPU, but suffered from Intel getting confused about what it should be.

If anyone would like to know more about Intel, I think this AMA is much better

https://www.reddit.com/r/IAmA/comments/15iaet/iama_cpu_archi...

cordite · on Aug 18, 2016

Will there ever be an ARM coprocessor with several hundred nodes available?

kartD · on Aug 17, 2016

Nice, but how is this going to fit with Nervana?

Also more than hardware how do Intel's libraries compare with CuDNN?

At the end of the day ease of use and software support matter along with the hardware.

zump · on Aug 18, 2016

Dude, they bought Nervana like yesterday.

Nervana has its own silicon, but I doubt they will tape out.

frozenport · on Aug 18, 2016

Intel has thousands of employees and a compiler. Get 200 of them in a room and implement CUDA.

visarga · on Aug 18, 2016

They don't even have to implement the latest flavor of "Inception module", they only need to implement matrix vector operations and some math primitives like exponential, log, tangent and such. Why is it so hard to port to Intel? I would have liked to make use of my Macbook's Intel Iris GPU for deep learning, but it's not supported by anything.

nl · on Aug 18, 2016

OpenCL supports Iris.

ThinkBeat · on Aug 18, 2016

I wish they had a range that was affordable for a hobbyist. You can buy a cheap Nvidia card to "get your feet wet".

I would like to play with these things.

marmaduke · on Aug 18, 2016

A recent multicore i7 (e.g. 4 core Haswell, 8 way SIMD for single precision = 32 threads) is enough to prototype OpenCL code which you can then run on larger CPUs or GPUs.

dogma1138 · on Aug 18, 2016

Intel was selling the Xeon Phi 31S1P for under 200$ (it's back to 500$ now) for a limited time. They will likely to have cheap version and promotions this time around too.

rch · on Aug 18, 2016

Keep an eye on Colfax - they've had some nice deals in the past:

http://www.colfax-intl.com/nd/

imaginenore · on Aug 18, 2016

Why not use Google's platform, which runs on their new custom chips (Tensor Processing Unit)?

visarga · on Aug 18, 2016

Because it's not made to accelerate training, just inference. The TPU is an 8-bit fixed point processor less power hungry than GPUs, so it won't help research, only deployment for large projects, running in the cloud.

vonnik · on Aug 18, 2016

Why use it, if it locks you in?

xadhominemx · on Aug 18, 2016

If you're a hobbyist, why would you care about lock in?

lmm · on Aug 18, 2016

I care about being able to carry on doing my hobby stuff 5 or 10 years from now. With NVidia I can at least be confident that as long as my graphics card keeps working (which feels like something under my control, unlike Google shutting down their products) I can keep running my code on it.

flamedoge · on Aug 18, 2016

Because you don't have control over what Google does. They may kill TPU altogether leaving your work irrelevant.

willvarfar · on Aug 18, 2016

You don't have control over what nvidia, intel or the rest of them either.

If you want to get your 'feet wet', then why bother?

flamedoge · on Aug 30, 2016

Because they have client facing products and backward compatibility is important to clients.

hyperbovine · on Aug 18, 2016

Hobbies come, hobbies go.

zump · on Aug 18, 2016

How much?