... Pretty sure LeCun and I understand what the Von Neumann Bottleneck is thank ...

emcq · on Feb 12, 2017

I appreciate healthy discussion of technical topics. However, I'm not sure you're having this discussion in good faith. I wrote this response in case you are.

LeCun never said anything about the Von Neumann Bottleneck. TrueNorth is not a Von Neumann architecture; it does not have a memory bus; it does not have the Von Neumann Bottleneck [0,2,3]. From wikipedia [1]:

"TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors"

If you disagree, please explain how you think the Von Neumann Bottleneck applies here.

With regards to energy consumption keep in mind the smallest GPUs (TX1) are ~10W, typical FGPAs ~1W, versus 70mW for TrueNorth! It's popular to hate on TrueNorth but you could throw 10 of them together and still be fantastically more efficient than anything else today - that's super cool to me! It required lots of special engineering effort to get right, such as building a lot of on chip memory.

On chip memory is one of the most difficult components to get right, minimizing transistors while not breaking physics. It's not as simple as "pouring tons of memory on a die" and requires specialized engineers that hand-layout these components. The event driven asynchronous nature of TrueNorth is fairly unique and undoubtedly added complexity to the memory design.

Do you have any references or evidence for CNN runtimes being mostly dominated by compute? The operations performed in a CNN are more than just convolution; for every input you multiply by a weight, you now have a memory bound problem, which is much more expensive than ALU operations. Don't just take my word for it, listen to Bill Dally (Chief Scientist at NVIDIA, Stanford CS prof, and general computer architecture badass) [4]:

"State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power."

This is what TrueNorth got right, and made its bet completing design before AlexNet was published. That was a time where Hinton was viewed by the ML community as a heretic talking about RBMs and backprop and hardly anyone believed him. TrueNorth, like NNs at the time, gets some shade by doing things differently that over time we're seeing validated by other researchers and architectures incorporating.

I recommend reading [4] if you haven't already, as it is rich in insights for building efficient NN architectures.

[0] https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N...

[1] https://en.wikipedia.org/wiki/TrueNorth

[2] http://ieeexplore.ieee.org/document/7229264/?reload=true&arn...

[3] http://www.research.ibm.com/articles/brain-chip.shtml

[4] https://arxiv.org/pdf/1602.01528.pdf

p1esk · on Feb 12, 2017

GPUs (TX1) are ~10W, typical FGPAs ~1W, versus 70mW for TrueNorth!

These numbers are meaningless. If you want to compare power consumption for different chips, you need to make sure they:

1. Perform the same task: running the same algorithm on the same data

2. Use the same precision (number of bits) in both data storage, and computation.

3. Achieve the same accuracy on the benchmark.

4. Run at the same speed (finish the benchmark at the same time). In other words, look at energy per task, not per time.

If even a single one of these conditions is not met, you're comparing apples to oranges. No valid comparisons have been made so far, that I know of.

p.s. The numbers you provided are off even ignoring my main point: typical power consumption of an FPGA chip is 10-40W, and I don't know where you got 70mW for TrueNorth, and what it represents.

deepnotderp · on Feb 12, 2017

Also those are teeny 32x32 images.

deepnotderp · on Feb 12, 2017

I do mean to have a good technical talk, apologies if I sounded derisive, I was kind of annoyed you would assume LeCun and I don't know what the Von Neumann Bottleneck is.

Anyways, as for showing that convolution runtimes are dominated by compute, not lookup, as much as I tried, I couldn't find the goddamn chart that shows the breakdown, but I did see that chart somewhere and my own experiments show that to be true. It IS true however, that in general memory access is far more expensive than the operations, but deep nets are basically "take the data in and chew on it for a long time". Besides, the way to beat the Von Neumann Bottleneck likely lies in fabrication technology, not design (like HBM2 and TSVs). What makes you think their SRAM cell is custom? It appears to be a standard SRAM cell. That's what I meant by "pouring memory on die".

And the von neumann bottleneck is primarily caused by memory access (aka data movement) being expensive. What happens if you have to move data between multiple truenorth chips?

emcq · on Feb 12, 2017

Apologies if I'm being too academic here but not all memory bottlenecks or communication bottlenecks are the "Von Neumann Bottleneck".

The term was originally focused on both data and program memory being on the other side of a shared bus from the CPU, which meant you could only do one at a time. If you were trying to figure out your next instruction, you wait. If you then need to data, you wait. It wasn't a problem back with slowly executing EDVAC code [0]. Based on this definition most architectures today do not have the Von Neumann Bottleneck as they are not Von Neumann architectures [1, 2, 3].

A slightly looser definition of the Von Neumann Bottleneck refers to the separation between CPU and memory with a single bus. This likely originated because fully Von Neumann architectures are so rare but that the general problem is similar enough to share the name. GPUs dont have this issue because they employ parallelism thru multiple memory ports talking to off chip RAM. TrueNorth also doesn't have this issue because it has 4096 parallel cores with their own localized memory and no off chip memory. There could certainly be other bottlenecks in the system, even with the memory system, but those wouldn't be the Von Neumann Bottleneck [0].

[0] https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_N...

[1] https://news.ycombinator.com/item?id=2645652

[2] http://ithare.com/modified-harvard-architecture-clarifying-c...

[3] http://cs.stackexchange.com/questions/24599/which-architectu...

deepnotderp · on Feb 12, 2017

Yeah, this is true, but the point I wanted to make is that physics doesn't care what it's called, just that you're moving data incredibly long distances with massive decoders. This is, in essence, what costs the huge amount of energy. By contrast, "pouring memory on die" solves this problem almost completely, but your compute (which was your major problem anyways)is still your biggest issue, and it's gotten worse!

By "pour memory on die" I mean that the memory is on die, clearly there are some special techniques being used to manage that memory, but physically, this is what's saving power.

deepnotderp · on Feb 12, 2017

Here's at least a start: http://www.slideshare.net/embeddedvision/tradeoffs-in-implem...

As you can see, ~10-1000X (the scale is logarithmic) more is spent on compute rather than data movement, and that's with DDR, not even HBM2, let alone on-chip!