Intel Gets Serious About Neuromorphic, Cognitive Computing Future

paulsutter · on Feb 11, 2017

The article misses the core issue: neural network architectures are still in flux. "Neuromorphic" chips are hardwired to one architecture, which makes them power efficient but less flexible. When designs are more stable, such chips could be more practical.

Meanwhile, upcoming (non-neuromorphic) AI processors are taking two directions: larger numbers of simplified GPU-type cores (such as NVIDIA Xavier and Intel's Lake Crest/Nervana chips), and FPGAs.

Simplifying cores means lower precision, as fp32 and fp64 are overkill for neural networks and take up lots of silicon. The current NVIDIA Pascal added fp16 and byte operations such as the DP4A convolution instruction[1]. Even smaller precision is practical (down to 1 bit with XNORnet[2], and the DoReFa paper[3] gives an excellent summary of the falloff in accuracy through 32-8-4-2-1 bits for weights, activations, and gradients).

[1] https://devblogs.nvidia.com/parallelforall/mixed-precision-p...

[2] XNORnet, https://arxiv.org/abs/1603.05279

[3] DoReFa, https://arxiv.org/abs/1606.06160

deepnotderp · on Feb 11, 2017

This is a common fallacy, although the architectures themselves may be changing (quite a bit!), the basic computationally intensive operations, such as convolutions and matrix multiplies aren't. The simple way is to switch from FP32 to 16-bit fixed point, and you're good to go, and you just saved almost 10X power. This is the strategy Nervana/Intel, even Nvidia, and other startups such as Wave are pursuing.

paulsutter · on Feb 11, 2017

Exactly - evolving network architectures suggest using a non-neuromorphic design (such as Xavier, Lake Crest), which trades less precision for more cores given the same power/real estate.

Neuromorphic designs like IBM's True North are more hardwired, and that's the limitation towards general purpose use. Yann LeCun's remarks on True North:

https://www.facebook.com/yann.lecun/posts/10152184295832143

emcq · on Feb 11, 2017

The reality is that TrueNorth can use CNNs and be trained with backprop to achieve good performance with networks like ImageNet despite LeCun's comments [0, 1]

LeCun's primary critique is that binary wont work; "to get good results on a task like ImageNet you need about 8 bit of precision on the neuron states". There was no evidence for his claim then, and now this is clearly false [2, 3].

LeCun's post is based more in pride than reason; he spends most of the time talking about NeuFlow which at one point was a competitor to True North for funding. In the end, NeuFlow never became a chip, but True North did.

[0] https://papers.nips.cc/paper/5862-backpropagation-for-energy...

[1] https://arxiv.org/pdf/1603.08270.pdf

[2] https://arxiv.org/abs/1603.05279

[3] https://arxiv.org/abs/1602.02830

deepnotderp · on Feb 11, 2017

Yeah, this all sounds nice in practice, but look at the actual numbers: https://arxiv.org/abs/1603.08270

1) Nothing on imagenet

2) They already fall to 83% accuracy on CIFAR-10! Imagine how bad imagenet will be! If they string many chips together (exploding their power consumption, since here comes the Von Neumann Bottleneck of data movement), they get a paltry 89%.....

Meanwhile even squeezenet achieves better results.

emcq · on Feb 11, 2017

I'm sure today we could create an _even better_ chip focused on the advancements in neural networks and chip design, but it's awesome that a chip from 2012 can still take us so far! AlexNet was just a baby then. I doubt any CPU, GPU, FPGA, or DSP from 2012 would hold up as well.

I dont think you understand their architecture, and neither did LeCun. The Von Neumann Bottleneck is a specific term referring to limited throughput between data in memory and compute in the CPU. TrueNorth is not a Von Neumann architecture and does not have this bottleneck. Memory is located adjacent to compute elements in True North. For comparison, GPU's have very tiny amounts of on chip memory, and have to spend lots of energy copying data back and forth to off chip memory, which is why they are investing heavily in approaches like HBM. FPGAs also dont have as much memory as an ASIC because they need to dedicate space to reprogrammable logic, integrated ARM cores/DSPs, etc.

The chips can be laid out in flexible topologies such as a grid. While it's true that communication between chips is more power intensive than within a chip, this cost is only occurred for the relatively small amount of traffic sent between chips versus computed locally. Hierarchy and small world nature of neural networks can mean that there is more local computation than you would expect naively, and a grid can mean a spike routed from one core to the furthest core would be O(sqrt(N)) instead of O(N).

deepnotderp · on Feb 11, 2017

...

Pretty sure LeCun and I understand what the Von Neumann Bottleneck is thank you very much.

The thing is though that TrueNorth isn't doing anything special by pouring a ton of memory on die, and even in GPUs, CNN runtimes and energy consumption is dominated by compute.

emcq · on Feb 12, 2017

I appreciate healthy discussion of technical topics. However, I'm not sure you're having this discussion in good faith. I wrote this response in case you are.

LeCun never said anything about the Von Neumann Bottleneck. TrueNorth is not a Von Neumann architecture; it does not have a memory bus; it does not have the Von Neumann Bottleneck [0,2,3]. From wikipedia [1]:

"TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors"

If you disagree, please explain how you think the Von Neumann Bottleneck applies here.

With regards to energy consumption keep in mind the smallest GPUs (TX1) are ~10W, typical FGPAs ~1W, versus 70mW for TrueNorth! It's popular to hate on TrueNorth but you could throw 10 of them together and still be fantastically more efficient than anything else today - that's super cool to me! It required lots of special engineering effort to get right, such as building a lot of on chip memory.

On chip memory is one of the most difficult components to get right, minimizing transistors while not breaking physics. It's not as simple as "pouring tons of memory on a die" and requires specialized engineers that hand-layout these components. The event driven asynchronous nature of TrueNorth is fairly unique and undoubtedly added complexity to the memory design.

Do you have any references or evidence for CNN runtimes being mostly dominated by compute? The operations performed in a CNN are more than just convolution; for every input you multiply by a weight, you now have a memory bound problem, which is much more expensive than ALU operations. Don't just take my word for it, listen to Bill Dally (Chief Scientist at NVIDIA, Stanford CS prof, and general computer architecture badass) [4]:

"State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power."

This is what TrueNorth got right, and made its bet completing design before AlexNet was published. That was a time where Hinton was viewed by the ML community as a heretic talking about RBMs and backprop and hardly anyone believed him. TrueNorth, like NNs at the time, gets some shade by doing things differently that over time we're seeing validated by other researchers and architectures incorporating.

I recommend reading [4] if you haven't already, as it is rich in insights for building efficient NN architectures.

[0] https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N...

[1] https://en.wikipedia.org/wiki/TrueNorth

[2] http://ieeexplore.ieee.org/document/7229264/?reload=true&arn...

[3] http://www.research.ibm.com/articles/brain-chip.shtml

[4] https://arxiv.org/pdf/1602.01528.pdf

p1esk · on Feb 12, 2017

GPUs (TX1) are ~10W, typical FGPAs ~1W, versus 70mW for TrueNorth!

These numbers are meaningless. If you want to compare power consumption for different chips, you need to make sure they:

1. Perform the same task: running the same algorithm on the same data

2. Use the same precision (number of bits) in both data storage, and computation.

3. Achieve the same accuracy on the benchmark.

4. Run at the same speed (finish the benchmark at the same time). In other words, look at energy per task, not per time.

If even a single one of these conditions is not met, you're comparing apples to oranges. No valid comparisons have been made so far, that I know of.

p.s. The numbers you provided are off even ignoring my main point: typical power consumption of an FPGA chip is 10-40W, and I don't know where you got 70mW for TrueNorth, and what it represents.

deepnotderp · on Feb 12, 2017

Also those are teeny 32x32 images.

deepnotderp · on Feb 12, 2017

I do mean to have a good technical talk, apologies if I sounded derisive, I was kind of annoyed you would assume LeCun and I don't know what the Von Neumann Bottleneck is.

Anyways, as for showing that convolution runtimes are dominated by compute, not lookup, as much as I tried, I couldn't find the goddamn chart that shows the breakdown, but I did see that chart somewhere and my own experiments show that to be true. It IS true however, that in general memory access is far more expensive than the operations, but deep nets are basically "take the data in and chew on it for a long time". Besides, the way to beat the Von Neumann Bottleneck likely lies in fabrication technology, not design (like HBM2 and TSVs). What makes you think their SRAM cell is custom? It appears to be a standard SRAM cell. That's what I meant by "pouring memory on die".

And the von neumann bottleneck is primarily caused by memory access (aka data movement) being expensive. What happens if you have to move data between multiple truenorth chips?

emcq · on Feb 12, 2017

Apologies if I'm being too academic here but not all memory bottlenecks or communication bottlenecks are the "Von Neumann Bottleneck".

The term was originally focused on both data and program memory being on the other side of a shared bus from the CPU, which meant you could only do one at a time. If you were trying to figure out your next instruction, you wait. If you then need to data, you wait. It wasn't a problem back with slowly executing EDVAC code [0]. Based on this definition most architectures today do not have the Von Neumann Bottleneck as they are not Von Neumann architectures [1, 2, 3].

A slightly looser definition of the Von Neumann Bottleneck refers to the separation between CPU and memory with a single bus. This likely originated because fully Von Neumann architectures are so rare but that the general problem is similar enough to share the name. GPUs dont have this issue because they employ parallelism thru multiple memory ports talking to off chip RAM. TrueNorth also doesn't have this issue because it has 4096 parallel cores with their own localized memory and no off chip memory. There could certainly be other bottlenecks in the system, even with the memory system, but those wouldn't be the Von Neumann Bottleneck [0].

[0] https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N...

[1] https://news.ycombinator.com/item?id=2645652

[2] http://ithare.com/modified-harvard-architecture-clarifying-c...

[3] http://cs.stackexchange.com/questions/24599/which-architectu...

deepnotderp · on Feb 12, 2017

Yeah, this is true, but the point I wanted to make is that physics doesn't care what it's called, just that you're moving data incredibly long distances with massive decoders. This is, in essence, what costs the huge amount of energy. By contrast, "pouring memory on die" solves this problem almost completely, but your compute (which was your major problem anyways)is still your biggest issue, and it's gotten worse!

By "pour memory on die" I mean that the memory is on die, clearly there are some special techniques being used to manage that memory, but physically, this is what's saving power.

deepnotderp · on Feb 12, 2017

Here's at least a start: http://www.slideshare.net/embeddedvision/tradeoffs-in-implem...

As you can see, ~10-1000X (the scale is logarithmic) more is spent on compute rather than data movement, and that's with DDR, not even HBM2, let alone on-chip!

deepnotderp · on Feb 11, 2017

Yeah, TrueNorth is pointless.

deepnotderp · on Feb 11, 2017

Also, just a sidenote, Xavier is based on Volta and is a while back.

pizza · on Feb 12, 2017

see: discussion for "Why are Eight Bits Enough for Deep Neural Networks?" 511 days ago

https://news.ycombinator.com/item?id=10244398

emcq · on Feb 11, 2017

For training networks, designing en efficient chip is a scary proposition.

However for inference tasks, low precision GEMM (as you said) goes a long way and better than what you often get. That's why chips like Movidius' Myriad are getting popular and are more similar to DSPs than neuromorphic designs.

I agree that Intel's neuromorphic group doesn't get it, but other groups have taken neuromophic design principles that lead to efficient designs. For example, TrueNorth is very low precision, has great data locality, and though it was designed over 5 years ago can still use modern convolutional networks only imagined afterwards [0]. But its silicon implementation is not very brain like.

[0] https://arxiv.org/pdf/1603.08270.pdf

russdill · on Feb 11, 2017

I think the Intel neuromorphic computing will be a combination of 3D XPoint and technologies gained by purchasing Altera, including routing, DSPs, and reconfigurablity.

anigbrowl · on Feb 11, 2017

Seems like the sort of thing FPGAs could be good for. I have a few synthesizers that use them; elaborate feedback loops are a staple of sound design.

deepnotderp · on Feb 11, 2017

XNOR and DoReFa take pretty big hits though, and they can't be easily trained without full precision, or at least "shadow weights".

paulsutter · on Feb 11, 2017

For XNORnet, yes, but DoReFa has training with low bitwidth activations[1]. I don't cite these papers as proof that few bits are needed, but as evidence that precision is an open question.

[1] "We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers" (from the abstract)

p1esk · on Feb 11, 2017

DoReFa needs a custom scaling factor for each image when quantizing gradients.

deepnotderp · on Feb 11, 2017

What's wrong with that?

p1esk · on Feb 12, 2017

Their formula for quantizing gradients involves quite a bit of extra computation. It's not clear to me how much it complicates an actual implementation (in software, or hardware). To me, the most interesting question is if we can do training using just 8 bits (weights, activations, and gradients), without all that acrobatics. If so, then we can get another significant (and free!) speed up from GPUs.

emcq · on Feb 11, 2017

What's wrong with training in full precision? There's a huge market for efficient inference in robotics, devices, and autonomous vehicles.

justinpombrio · on Feb 11, 2017

Could someone explain what a neuromorphic chip is?

The article assumes we know, but I haven't heard of it. And the wikipedia article on "neuromorphic engineering" talks about stuff like analog circuits, copying how neurons work, and memristors, none of which seem that related.

mljoe · on Feb 11, 2017

The best definition I can come up with is hardware that implements neural network architecture directly, especially McCulloch–Pitts spiking neurons (which have a temporal component). In neuromorphic chips, neurons are an actual component in the hardware, you can ask questions like "how many neurons does this chip have?". Contrast to how neural nets as we use them today which are actually implemented as a computation graph on tensors. It turns out that a special kind of neural network can be abstracted well as a series of tensor ops (dense feedforward layered networks[1]), but this is not necessarily the case for any neural network. So neuromorphic chips have a possibility of being far more general.

[1]: Which are wired something like this: http://neuralnetworksanddeeplearning.com/images/tikz40.png - Notice that dense connections and layered architecture. For all intents and purposes, this what neural nets look like today because of how easily it is to treat a NN with this specific wiring as a chain of tensor computations and thus execute on more conventional hardware.

nomailing · on Feb 12, 2017

What I remember from my neuromorphic engineering course (or analog VLSI course) is that we designed the silicon layout (with n and p doping regions) in a way that the transistors are operating in the subthreshold regime in the IV characteristics. If I remember correctly the IV characteristic is linear in the subthreshold region? In contast in normal digital chips only the super-threshold region is used (voltage above a certain saturation threshold switches the transistor completely on). Using the subthreshold region it is possible to implement spiking neuons with only very few transistors. It works completely different than digital circuits. The connections between the transistors don't transmit just 0's and 1's. Instead all wires transmit analogue signals where the exact voltage matters. This makes these chips extremely energy and space efficient. These chips can also work much faster even in comparison to biological neurons (obviously using some assumptions and simplifications, such as neglecting certain special kinds of ion channels found in real neurons).

petra · on Feb 12, 2017

Theoretically , Analog is by far the best for neural networks. But why aren't we starting to see chips offered ? Heck even an old process like 130nm could have some practical uses .

alexmlamb2 · on Feb 12, 2017

Could you implement an approximate matrix multiplication in a direct analogue way? If so, I wonder why it hasn't been used for graphics cards.

p1esk · on Feb 12, 2017

Yes: https://arxiv.org/abs/1610.02091

return0 · on Feb 11, 2017

Loosely defined, it's various chips that implement simplified computational models of neurons, and some plasticity functions. They are usually simplified because going into the greatest detail (modeling hodgkin-huxley type channels) would require too much computation. In the neuroscience community there is not yet an acceptable model for a simplified neuron so everyone picks some spiking neuron model or they make up one. Even less is known for plasticity functions.

I believe the idea is that if you simulate too many of them, something useful will happen.

mattlevan · on Feb 11, 2017

All of the subjects you mention at the end of your comment, especially memristors, are indeed the focus of neuromorphic computing. It's a nascent technology field, but there are plenty of papers available on IEEExplore or ACM Digital Library to satisfy curiosity! I'll take a look at the survey paper I wrote a couple summers back-- if it's decent I'll edit this post with a link.

Edit, links: Paper: https://docs.google.com/file/d/0B7QHR9a8j1iiU3RxSHZSNFh2cEdv... Slides: https://docs.google.com/file/d/0B7QHR9a8j1iiSE1ET2ZTb09aNFBP...

bobsil1 · on Feb 12, 2017

Asynchronous, analog, spike-based, co-located memory. Closer NN sim than von Neumann architecture, orders of magnitude more power-efficient.

One human brain-equivalent NN on classic architecture costs ~$70M and uses ~100 houses worth of power.

alexmlamb2 · on Feb 12, 2017

The rigor of this claim is very questionable. I assume it's based on neuron == hidden unit.

In any case though my guess is that a human neuron accomplishes way more than a hidden unit in a NN so it may be fair to view that as a lower-bound.

emcq · on Feb 11, 2017

The near future is not in putting learning inside a chip. We're a long ways off from the one shot learning needed to make a device actually interesting with localized learning.

Instead, the future is recording and uploading your observations to the cloud, with data scientist and neural net wizards training over this dataset on a cluster with tons of GPUs, and then deploying an optimized model to scrappy low precision inference chips.

This is why FPGA based designs will fail to be stunning. Specialized low precision ASICs more similar to DSPs like Movidius' Myriad (in the Phantom drones and Google's Project Tango devices), Google's TPU, upcoming Qualcomm chips, or Nervana's will become increasingly popular.

espeed · on Feb 11, 2017

Are these the Neuromorphic chips [1] Jeff Hawkins of Numenta [2] has been talking about?

[1] Neuromorphic Chips https://www.technologyreview.com/s/526506/neuromorphic-chips...

[2] Numenta papers/videos http://numenta.com/papers-videos-and-more/

It would also be good to see a major chip manufacturer or cloud provider that makes its own chips (Google/IBM) get serious about graph processing chips [3,4] and moving beyond floating point [5].

[3] Novel Graph Processor Architecture https://www.ll.mit.edu/publications/journal/pdf/vol20_no1/20...

[4] Novel Graph Processor Architecture, Prototype System, and Results https://arxiv.org/pdf/1607.06541.pdf

[5] Stanford Seminar: Beyond Floating Point: Next Generation Computer Arithmetic https://www.youtube.com/watch?v=aP0Y1uAA-2Y

m3kw9 · on Feb 11, 2017

Seem like a fancy name for parallel computing chip that specialize in efficient parallel computing

adevine · on Feb 11, 2017

It's that what it actually is? Not familiar with the field, so is "neuromorphic chip design" really "we wired a bunch of GPUs together?"

Don't mean to minimize the work involved, just trying to decipher the marketing speak.

mrstone · on Feb 11, 2017

They are a chip architecture that communicates similarly to how neurons do - i.e. spiking behaviours. Makes it easy to approximate some neural systems (such as a neuromorphic retina).

DesiLurker · on Feb 11, 2017

neuromorphic chips try to mimic human brain's neuronal structure for computing, entirely new paradigm. checkout IBM truNorth architecture.

pzone · on Feb 12, 2017

I think the point is that even GPUs are suboptimal for neural network computation and we need something more specialized.

return0 · on Feb 11, 2017

Do 'Neuromorphic' chips do anything useful? At least neural networks have well known utility, but afaik, neuromorphic (i.e. heavily simplified models of neurons that one only hopes - but cannot prove are correct) have no useful applications - or even theoretical functions.

bobsil1 · on Feb 12, 2017

Low-power video object recognition for military drones, security cams, etc.

mrfusion · on Feb 11, 2017

Would it be possible to build an analog neural network with hardware?

emcq · on Feb 11, 2017

Yes, this is traditionally a big effort in the neuromorphic community originating in Carver Meads lab at Caltech.

Perhaps the best success of such analog computation efforts are with Neurogrid (full disclosure I worked with this group): https://web.stanford.edu/group/brainsinsilicon/

One thing to keep in mind is that communication is often still in digital spikes, which some may argue is more neuron like than an analog encoding.

deepnotderp · on Feb 11, 2017

Yes, I work at a startup that's doing this, but it's very dangerous with the amounts of noise, which seem fine on MNIST and even CIFAR, but you die on imagenet.

The key to circumventing this is very complicated and our "secret sauce".

38kkdiu · on Feb 12, 2017

Can you say a bit more about what you mean by "it's very dangerous with the amounts of noise"?

deepnotderp · on Feb 12, 2017

Network accuracy crashes

adamnemecek · on Feb 12, 2017

I'm hoping that we could see photonic computers.

aj7 · on Feb 12, 2017

For laughs, I searched this document for the term Nvidia.