Announcing TensorFlow 0.8 – now with distributed computing support

TrickedOut · on April 14, 2016

Is OpenCL anywhere on the roadmap? I now make laptop and desktop purchasing decisions almost entirely on Nvidia card presence. One reason I didn't get the latest MacBook Pro.

rryan · on April 14, 2016

It's on the page titled "Roadmap", so ... yes?

https://github.com/tensorflow/tensorflow/blob/master/tensorf...

dave_sullivan · on April 14, 2016

AMD is so woefully behind the curve in gpgpu and especially deep learning that their management should be replaced. They think they're still competing with Intel (they're not). Nvidia has a wide open field for the foreseeable future and this will end up being a bad thing for consumers.

dharma1 · on April 14, 2016

Couldn't agree more. They need to diversify and really push gpu compute and openCL. Where's the equivalent of CuDNN for openCL? How many AMD engineers would it take it build that and what would the impact be?

nl · on April 14, 2016

I don't think Google is working on it. Intel and AMD have some comments on the relevant issue[1].

As far as I can see there seems to be a lot of noise ("How do I run Caffe on OpenCL") and not huge amounts of progress.

The truth is that you are better off waiting for external NVidia GPUs[2] to become widely available[3] than waiting for a decent OpenCL implementation.

[1] https://github.com/tensorflow/tensorflow/issues/22

[2] http://udibr.github.io/using-external-gtx-980-with-macbook-p...

[3] http://www.pcworld.com/article/3019369/hardware/the-razer-co...

dharma1 · on April 14, 2016

the openCL branch of caffe has had a lot of work recently and is usable now. Still a bit slower than with CUDA/CuDNN but works

https://github.com/BVLC/caffe/tree/opencl

pvnick · on April 13, 2016

Can anybody offer a TLDR of how this works (or point me to one)? It seems particularly well-suited for convolutional nets with many layers if I understand correctly, but I am curious as to whether e.g. recurrent nets may receive the same speed-ups from parallelization.

dgacmu · on April 13, 2016

People at Google use multiple replicas to train RNNs and LSTMs to very good effect.

At heart, the most common distributed training mechanism creates multiple "replias" of the model -- each replica has a full copy. It splits the training data among the replias, and then at the end of every batch, synchronizes the updates to the model weights between the replicas. (A simple way would be to think of taking the average of the gradients produced at each replica, and having all replicas apply the average gradient. Equivalently, just reduce the per-replica learning rate and apply all of the gradients, then propagate back the new state to everyone.)

pvnick · on April 13, 2016

Ah, great, thanks for the explanation.

therobot24 · on April 13, 2016

Since TensorFlow was dubbed as slower than most (http://arxiv.org/abs/1511.06435) it'll be nice to see how this affects perceived performance

mrry · on April 13, 2016

We've made good progress on single-machine performance and memory usage in the latest release, especially for convolutional models (such as adding support for NCHW data layouts), and we'll continue to aim for best-in-class performance in that setting.

The cool thing about distributed TensorFlow is that it supports efficient synchronous optimizers, so you can scale up the effective batch size by using multiple GPUs, to get increased throughput without losing accuracy.

dgacmu · on April 13, 2016

That study's way out of date - it benchmarked the CuDNNv2 version. Soumith's convnet-benchmarks is much more up-to-date: https://github.com/soumith/convnet-benchmarks

but it hasn't yet been updated to reflect the latest performance improvements in 0.8. We've continued to push on both single-machine and distributed performance, and the next update to soumith's benchmarks should continue to show that improvement.

therobot24 · on April 13, 2016

>> That study's way out of date

I don't know about 'way' out of date, it was first published just a few months ago (November) and the authors pushed a revised version just a few weeks ago (March 30th), but i definitely agree that it's not using the most current implementations

>> Soumith's convnet-benchmarks is much more up-to-date

I'll definitely check these out, thanks for the link

vrv · on April 13, 2016

And even those numbers on the front page are out of date :) (we're even faster now: https://github.com/soumith/convnet-benchmarks/pull/96, which is from a few weeks ago.)

The field is moving quickly enough that many published benchmarks are stale within 3 months, and it's a lot of hard work to maintain up to date benchmarks, given how many frameworks there are. Also there are performance/memory/scalability/flexibility tradeoffs everywhere, so it's hard to capture everything in one number without a tremendous number of caveats.

dgacmu · on April 13, 2016

vrv addressed why I called it "way" out of date - in the time since the study was done with CuDNNv2, we've moved TensorFlow to CuDNN v4, and NVidia released the CuDNNv5 release candidate a week ago. Each of those releases provides a pretty big speed bump for specific types of DNNs, and we've been pushing out some very significant speed bumps for TensorFlow at the same time.

My conclusion from this is that Soumith's approach to having a living repository is the way to go. It's harder to call it a "publication", but it's providing something of more lasting value than a static performance snapshot in a field where the engineering is moving so quickly.

fudged71 · on April 13, 2016

I wonder how effective this would be on a fleet of raspberry pis. With things like Resin.io, Weave, and Kubernetes, I wonder if it would be possible to create something like Seti@home for crowdsourced machine learning for all kinds of different applications. Many of us have spare raspberry pis laying around that could be utilized in a global network.

wyldfire · on April 13, 2016

You'd probably have to scale to hundreds or thousands of pis to achieve the performance you could see from a single $100-200 GPU.

nl · on April 14, 2016

One person has managed to successfully build a non-accelerated version of TensorFlow for the RaspberryPi. It can use a network, but training will be painfully, PAINFULLY slow (as in months or years of wall-clock time).

Maybe at some point it will be viable, but not with the hardware and software as it is at the moment.

wodenokoto · on April 13, 2016

Notice that while 8 gpu are 8x as effective as 1, 16 gpu are only 15x and a hundred gpu doesn't even get you 70x speedup.

I doubt your idea would prove efficient.

bmh100 · on April 13, 2016

At the very least, computation could be distributed at the hyperparameter tuning stage. Each node would be responsible for training on a different set of hyperparameters. The master node would coordinate the selection and distribution of new hyperparameter sets, amortizing the data-set distribution time.

It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.

High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.

babo · on April 13, 2016

Moving data would be a bottleneck for sure. Distributing the model itself with the state of the calculation and the required samples is just too much compared to the CPU power a raspberry pi can provide.

manav · on April 13, 2016

I would have liked more detail about the cluster. You would still need high speed interconnects (like Infiniband) between the nodes/machines so I don't think crowdsourcing would work.

This could be interesting if ported to an FPGA though. That could give you that power/performance tradeoff.

misiti3780 · on April 13, 2016

It depends on what kind of algorithms you are using for classification/ml. Some algorithms can be distributed easily, like recommendation engines, etc. Others, like say SVMs, are much harder to distribute.

If you checkout apache mahout you can get an idea of what is possible and what is not.

taliesinb · on April 13, 2016

I'm getting 404s for some of the tutorial sections when selecting r0.8 (e.g. https://www.tensorflow.org/versions/r0.8/tutorials/mnist/tf/...). master works. Seems like some of the documentation is only built for master and for r0.7, not for r0.8.

vrv · on April 13, 2016

(Do you have an example link that doesn't work? I clicked a bunch of links there and they were all working. Feel free to file a bug at github.com/tensorflow/tensorflow)

taliesinb · on April 13, 2016

The link I gave, and others, repeatedly didn't work when I tried them, but now they seem to work!

modeless · on April 13, 2016

Very cool! Any progress on Windows support?

vrv · on April 13, 2016

It's on the roadmap: https://www.tensorflow.org/versions/r0.8/resources/roadmap.h...

hebdo · on April 13, 2016

I doubt it is a priority. But I can certainly recommend Amazon GPU-enabled instances (~0.6$/hour/GPU, not that much actually).

modeless · on April 13, 2016

Linux is great for training, however I would like to deploy my models to run locally on user machines which are running Windows. Theano supports Windows but TensorFlow doesn't.

AgentME · on April 14, 2016

Can models be ported between systems easily?

babo · on April 13, 2016

g2 instances has a GPU which is not compatible with the stock tensorflow, you must rebuild it from source. Do you have a workaround for that?

vrv · on April 14, 2016

I believe our published wheels now include the code for cuda compute 3.0, so it should work out of the box now.

(as long as the images have cudnn v4 and cuda 7.5 installed, I think :)

babo · on April 15, 2016

Great news, I'll try today!

manav · on April 13, 2016

I find the Amazon GPU prices pretty high in the long run. The g2.2xlarge is around 3x slower than a GTX 980.

p1esk · on April 13, 2016

It should be a priority. The main reason I picked Theano (and still use it) is Windows support.

elcct · on April 13, 2016

I can predict in 10 years we will see a rise of computer psychotherapists.

hiddencost · on April 14, 2016

Nice; it only took them 7 months to catch up to amazon:

http://www.nikkostrom.com/publications/interspeech2015/strom...

dgacmu · on April 14, 2016

For others who may be interested in the details despite the uninformative tone of this comment: The Amazon paper is about a specific tweak to learning rates for better scalability when doing distributed training. The core principles of distributed DNN training are much older - for example, Dean et al. 2012: https://papers.nips.cc/paper/4687-large-scale-distributed-de... trained a model for Imagenet on 2000 cores, using the DistBelief framework that is the predecessor to TensorFlow.

The question of how to improve the multiple-replica scaling of distributed DNN training is very important, as is the question of creating usable, flexible, and high-performance abstractions in which to implement one. They're also fairly orthogonal. TensorFlow as an architecture focuses on the latter. One could imagine implementing the Amazon tweak within either TF or any other framework.

arthurcolle · on April 14, 2016

Funny comment! Except this is actually usable by developers with non-expert knowledge of neural nets