XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

hacker42 · on March 20, 2016

Fantastic work. They do rely on floating point numbers for the weight updates (I was secretly hoping that they would have binarized gradient descent).

li-ch · on March 20, 2016

Seems that CNTK already implemented this.

https://github.com/Microsoft/CNTK/wiki/Enabling-1bit-SGD

modeless · on March 20, 2016

That quantizes the gradient for compression of the communication between nodes, which is cool, but each node must still calculate a 32 bit floating point gradient locally. What GP is asking for is a way to avoid having any floating point math at all.

If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.

Houshalter · on March 20, 2016

It can't work with gradient descent, which requires about 16 bits of precision. The problem is the gradients are small, and so are rounded down to zero. Then when added together they get zero, even though the number should be higher.

However it's been shown this can be solved by using stochastic rounding. But I believe that would require specialized hardware to implement. I'm not sure.

kuschku · on March 20, 2016

Can’t you just store the logarithm of the value instead of the actual value?

Then rounding should be irrelevant.

Houshalter · on March 21, 2016

Logarithms are both expensive to compute, and don't magically add any more bits of accuracy.

kuschku · on March 21, 2016

Well, (a) 2-based logarithms are cheap to compute, and (b) add quite some bits of accuracy if the alternative is rounding the value to 0.

Houshalter · on March 22, 2016

They are not cheap to compute, I believe the best way to approximate them is using expensive polynomial approximation which consumes a few cycles and operations.

All they do is map the numbers into a different range. A logarithmic value can represent more numbers near 0, but in order to add two logarithms, they need to be converted back to normal form, where the small numbers are rounded down to zero again.

There's no way around that, as adding a very small number to a very large number will always require many bits of precision to do accurately, regardless what transforms you use on the numbers.

hacker42 · on March 20, 2016

> It can't work with gradient descent, which requires about 16 bits of precision.

Has this been mathematically proven, can you provide a source?

mjw · on March 20, 2016

It's more an empirically-verified thing than a mathematical fact, there's nothing magic about 16 bits AFAIK. Empirically 16 bits seems to work well enough for some tasks, taking it down to 8 bits is usually taking it too far, and performance-wise there's not a lot of point playing with values in between e.g. 12 bits.

(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)

Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence [0]. It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.

Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.

[0] https://en.wikipedia.org/wiki/Stochastic_approximation#Robbi...

nl · on March 20, 2016

The Nervana 16bit floating point implementations is probably worth noting in this context.

They perform (speed-wise) pretty well: https://github.com/soumith/convnet-benchmarks

karmacondon · on March 20, 2016

What would be the advantage of binarized gradient descent?

jimfleming · on March 20, 2016

For one, more compression during parameter transfer in data parallelism scenarios.

tachyonbeam · on March 20, 2016

And maybe the possibility of building more efficient hardware. Floating point units take up a lot of transistors (in the ~100K transistor range AFAIK), whereas boolean logic is tiny.

Scaevolus · on March 20, 2016

Going to 1-bit precision is probably overkill, but papers have shown how neural networks with 8 bits of precision and simple ALUs can give results equivalent to full FPUs.

xiphias · on March 20, 2016

58x speedup sounds great, the question that is unanswered here is wheter using more layers can get back the accuracy, and what kind of speedup can be achieved compared to the original version?

lucciano · on March 20, 2016

Sounds interesting :-) But is there any Github repo or Dockerfile cook book available ? Last year, I have noticed these 2 repos: https://github.com/kevinlin311tw/caffe-cvprw15, https://github.com/kevinlin311tw/Caffe-DeepBinaryCode

How different are they ?