Hacker News new | past | comments | ask | show | jobs | submit login
Intel Announces Knights Mill: A Xeon Phi for Deep Learning (anandtech.com)
94 points by scaz on Aug 17, 2016 | hide | past | favorite | 56 comments



We badly need an alternative to Nvidia/CUDA for deep learning... but realistically, if Intel wants to make headway in the deep learning market, it must offer hardware that can not only compete on performance with Nvidia, but also work out-of-the-box (that is, without requiring lots of one-off tinkering and tweaking) with popular deep/machine learning frameworks like TensorFlow, Caffe, Torch, and Theano.

There is a lot of software infrastructure being built atop these frameworks, and switching costs are getting higher by the day. No one wants to use some kind of 'non-standard' fork of [name your DL framework of choice] customized for Intel hardware, because such a fork can quickly get stale in comparison to the upstream project.

Intel needs to be both better/faster and drop-in compatible with the popular frameworks.


Tensorflow has a mode that lets it run on CPUs. I'm sure other frameworks are the same. Isn't the whole point of Xeon Phi that it looks basically like an x86 CPU with a ton of cores? If so there is almost nothing to port, just the kernel launching.

Granted, you do have to bother to make a fast x86 / AVX-512 port of your code. But because the shape of GPUs is so different than CPUs -- GPUs have a more complicated memory hierarchy for one thing -- I kind of doubt that "just run your CUDA code in the Phi" is going to work well for nontrivial examples.

(Disclaimer, I work at Google on CUDA support in clang. Which is awesome, you should try it out. :) Google "cuda clang" for instructions.)


To elaborate: Tensorflow is clever because it takes your flow graph and cuts it into pieces which it then executes on computational units. Any computational unit will do, really as long as you have a backend for it. This choice makes adaptation to new system much easier.

I believe Caffe and Theano uses the same model, but I didn't study it. There are also some similarity in the model to what OCaml does with the incremental library, though it is not for machine learning.


"Granted, you do have to bother to make a fast x86 / AVX-512 port of your code."

That is exactly the problem: no one has bothered to make a fast x86 / AVX-512 port for any of the most popular frameworks (at the upstream level, not in some fork), and no one has an incentive to bother, other than Intel. For example, as far as I know, none of the popular frameworks take advantage of Intel's MKL out of the box.

Right now, if you want out-of-the-box high performance, Nvidia hardware is your only practical choice.


This is also on upstream projects not to lock themselves into CUDA. Yes, it's great, but everyone suffers when there's only one supported API. Even more so when it's closed and locked to a specific vendor, as CUDA is.


I'd love an alternative to CUDA.

The problem is that as far as I can see, OpenCL is in no way that. Basically, OpenCL gives me the impression that the oceans of boiler plate required both make development hard and effectively locks you into a specific vendor also since the boiler-plate is going to be setting things up for one's specific vendor.


Nope. I have some substantial simulation code written against OpenCL that runs on Intel OpenCL and NVIDIA without modifications, and rather performant on both of them. The only part of the code specific to vendor is the platform selection, which is one line of code.

OpenCL falls down in terms of standard libraries such as cu{dnn,sparse,blas} but if you're writing everything from scratch it's fine.


I'm an independent developer in the process of choosing a GPGPU library.

I can see simple, comprehensible 20-50 sample code for cuda that does most simple tasks. With OpenCL, I get references to version, boiler-plate, mode with nothing that boils down to simple code.

If you have a simple sample, you should post it here or blog about it.


Look at Pyopencl; it makes it quite easy to prototype OpenCL apps in tens of lines of code.

Later, it's straightforward to port to c or c++ if that's your thing, though I find having numpy et al handy even in production code


Boilerplate is one thing—this is what transpilers are for, among other solutions—but can OpenCL match CUDA on a performance level?


Cl is comparable to Cuda where the cuda code doesn't employ nvidia specific primitives.


Interesting...so do you use the vector subset designed for CPUs or the wavefront subset designed for GPUs?


I write code which maximizes coalesced memory access and plain old 32 bit floats.. nothing special. The respective drivers do a good job of mapping that onto the hardware given sufficient work group sizes.


Only with OpenCL 2.x have they came around and started to support C++ for writing kernels, as well as, a standard bytecode for any other language to target. Which most vendors still don't support.

Whereas CUDA supported C++ and Fortran from day 1, with the PTX support added a few versions later.

Also the debugging tools, from the presentations I have seen, are much more developer friendly on CUDA.

Of course developers rather use APIs that offer more modern experiences than ones still stuck in pure C, with a compiler at the driver level, forcing each programmer to writer the boilerplate to compile and link.

Now it might already be too late for OpenCL in spite of the latest improvements.


Can anyone tell me what exactly is missing from OpenCL to be able to run the primitives of deep learning frameworks? Like, does it not have some kind of operation that is essential for matrix manipulation?


Up to OpenCL 2.1 was support for C++ and Fortran, forcing everyone to either use C or generate C code from their compilers.


Can companies that don't have agreements/court rulings with Intel implement Xeon Phi ISAs?


I think it's fair to assume any significant HPC solution will have to be recompiled to target its runtime environment. Squeezing out every GFLOPS of a part is normal when you get an expensive specialized computer.

In that context, x86 compatibility is less of an issue than the quality of the compilers. As long as you are easier to program than a GPU, you are good.


I wish this were true, but it's not. Multi-million-dollar contracts have been scuttled because a given customer runs codes that they receive as binaries from a vendor, closed-source, and Company A bid Processor X, but the codes only run on Processor Y...


That's interesting. I'd assume a market so performance driven would compete such vendors out of existence.


The market is performance-driven because it is task-driven. Performance as a requirement, in other words, is a byproduct of the mission. If your task requires software that someone else holds the keys to, you have no choice but to mold the rest of your environment to fit that software.

Thankfully this sort of situation is becoming less common over time, but we're not all the way done yet.


It's a big relief to know my assumptions will, eventually, be true... :-)


I agree, AMD is working on making it easy to port CUDA[1], which still doesn't provide a strict alternative, but it's something

http://wccftech.com/amd-cuda-compilercompatibility-layer-ann...


Google also released GPUCC (http://research.google.com/pubs/pub45226.html), and got CUDA support into upstream clang.

LLVM also has an AMD GPU backend, and says this thing is built on clang/llvm.

So i suspect it's based on that support :)


There is a Nvidia Maxwell assembler https://github.com/NervanaSystems/maxas

But yeah this isn't production ready


Except that up to now at least, CUDA IMO remains the best abstraction for programming multi-core: subsuming away multiple threads, SIMD width, and multiple cores into the language definition. AMBER (http://www.ambermd.org) literally "recompiled and ran" with each succeeding GPU generation since GTX 280 in 2009. 3-5 days of subsequent refactoring then unlocked 80% of the attainable performance gains of each of the subsequent GPUs. DSSTNE (https://github.com/amznlabs/amazon-dsstne) just ran as well, but it's only targeting Kepler and up because the code relies heavily on the __shfl instruction.

So I honestly don't get the Google clang CUDA compiler right now. It's really really cool work, but I don't get why they didn't just lobby NVDA heavily to improve nvcc. With the number of GPUs they buy, I suspect they could have anything they want from the CUDA software teams.

However, if it could compile CUDA for other architectures, sign me up, you'd be my heroes.

For I'd love to see CUDA on Xeon Phi and on AMD GPUs (I know, they're trying). And if Intel poured the same amount of passion and budgeting into building that as they are pouring into fake^H^H^H^Hdeceptive benchmark data and magical powerpoint processors we won't see for at least a year or two (and which IMO will probably disappoint just like the first two), they'd be quite the competitor to NVIDIA, no?

That said, the Intel marketing machine seems to have succeeded in punching NVDA stock in the nose the past few days and in grabbing coverage in Forbes (http://www.forbes.com/sites/aarontilley/2016/08/17/intel-tak...) so maybe they know a thing or two I don't.


How many people are actually deploying to production though? It seems like its mostly research papers and enthusiasts out there in the wild yet.

Are we talking startups? A lot of startups know python so that would make sense...I'd love to see some actual stories though.


But being drop-in compatible means supporting CUDA or the CPU interface.


They could also contribute code to the most popular frameworks, instead of releasing forks like "Intel Caffe."[1]

[1] https://github.com/intelcaffe/caffe


Yes, but unless the framework authors maintain the contribution (possibly unlikely depending on the situation) they will have to keep maintaining those contributions.


Is there a problem with a system that's CUDA compatible?

CUDA seems to me like the only simple SIMD-type computing system that's fairly straightforward to program and understand at this point.

Drop-in CUDA compatibility seems like a good thing.


> We badly need an alternative to Nvidia/CUDA for deep learning

The post I am replying to.


Since Intel's product is just a bunch of CPUs it should work with OpenCL out of the box.


It does... Badly... And you used to have to pay for it...


This earlier thread on HN might be of interest:

Why didn't Larrabee fail? https://news.ycombinator.com/item?id=12293308


Not a very good one, I feel it covers up too much of Larabee's past. I think it was well established that it would be a GPU, but suffered from Intel getting confused about what it should be.

If anyone would like to know more about Intel, I think this AMA is much better

https://www.reddit.com/r/IAmA/comments/15iaet/iama_cpu_archi...


Will there ever be an ARM coprocessor with several hundred nodes available?


Nice, but how is this going to fit with Nervana?

Also more than hardware how do Intel's libraries compare with CuDNN?

At the end of the day ease of use and software support matter along with the hardware.


Dude, they bought Nervana like yesterday.

Nervana has its own silicon, but I doubt they will tape out.


Intel has thousands of employees and a compiler. Get 200 of them in a room and implement CUDA.


They don't even have to implement the latest flavor of "Inception module", they only need to implement matrix vector operations and some math primitives like exponential, log, tangent and such. Why is it so hard to port to Intel? I would have liked to make use of my Macbook's Intel Iris GPU for deep learning, but it's not supported by anything.


OpenCL supports Iris.


I wish they had a range that was affordable for a hobbyist. You can buy a cheap Nvidia card to "get your feet wet".

I would like to play with these things.


A recent multicore i7 (e.g. 4 core Haswell, 8 way SIMD for single precision = 32 threads) is enough to prototype OpenCL code which you can then run on larger CPUs or GPUs.


Intel was selling the Xeon Phi 31S1P for under 200$ (it's back to 500$ now) for a limited time. They will likely to have cheap version and promotions this time around too.


Keep an eye on Colfax - they've had some nice deals in the past:

http://www.colfax-intl.com/nd/


Why not use Google's platform, which runs on their new custom chips (Tensor Processing Unit)?


Because it's not made to accelerate training, just inference. The TPU is an 8-bit fixed point processor less power hungry than GPUs, so it won't help research, only deployment for large projects, running in the cloud.


Why use it, if it locks you in?


If you're a hobbyist, why would you care about lock in?


I care about being able to carry on doing my hobby stuff 5 or 10 years from now. With NVidia I can at least be confident that as long as my graphics card keeps working (which feels like something under my control, unlike Google shutting down their products) I can keep running my code on it.


Because you don't have control over what Google does. They may kill TPU altogether leaving your work irrelevant.


You don't have control over what nvidia, intel or the rest of them either.

If you want to get your 'feet wet', then why bother?


Because they have client facing products and backward compatibility is important to clients.


Hobbies come, hobbies go.


How much?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: