Hacker News new | past | comments | ask | show | jobs | submit login
Announcing TensorFlow 0.8 – now with distributed computing support (googleresearch.blogspot.com)
214 points by mrry on April 13, 2016 | hide | past | favorite | 40 comments



Is OpenCL anywhere on the roadmap? I now make laptop and desktop purchasing decisions almost entirely on Nvidia card presence. One reason I didn't get the latest MacBook Pro.


It's on the page titled "Roadmap", so ... yes?

https://github.com/tensorflow/tensorflow/blob/master/tensorf...


AMD is so woefully behind the curve in gpgpu and especially deep learning that their management should be replaced. They think they're still competing with Intel (they're not). Nvidia has a wide open field for the foreseeable future and this will end up being a bad thing for consumers.


Couldn't agree more. They need to diversify and really push gpu compute and openCL. Where's the equivalent of CuDNN for openCL? How many AMD engineers would it take it build that and what would the impact be?


I don't think Google is working on it. Intel and AMD have some comments on the relevant issue[1].

As far as I can see there seems to be a lot of noise ("How do I run Caffe on OpenCL") and not huge amounts of progress.

The truth is that you are better off waiting for external NVidia GPUs[2] to become widely available[3] than waiting for a decent OpenCL implementation.

[1] https://github.com/tensorflow/tensorflow/issues/22

[2] http://udibr.github.io/using-external-gtx-980-with-macbook-p...

[3] http://www.pcworld.com/article/3019369/hardware/the-razer-co...


the openCL branch of caffe has had a lot of work recently and is usable now. Still a bit slower than with CUDA/CuDNN but works

https://github.com/BVLC/caffe/tree/opencl


Can anybody offer a TLDR of how this works (or point me to one)? It seems particularly well-suited for convolutional nets with many layers if I understand correctly, but I am curious as to whether e.g. recurrent nets may receive the same speed-ups from parallelization.


People at Google use multiple replicas to train RNNs and LSTMs to very good effect.

At heart, the most common distributed training mechanism creates multiple "replias" of the model -- each replica has a full copy. It splits the training data among the replias, and then at the end of every batch, synchronizes the updates to the model weights between the replicas. (A simple way would be to think of taking the average of the gradients produced at each replica, and having all replicas apply the average gradient. Equivalently, just reduce the per-replica learning rate and apply all of the gradients, then propagate back the new state to everyone.)


Ah, great, thanks for the explanation.


Since TensorFlow was dubbed as slower than most (http://arxiv.org/abs/1511.06435) it'll be nice to see how this affects perceived performance


We've made good progress on single-machine performance and memory usage in the latest release, especially for convolutional models (such as adding support for NCHW data layouts), and we'll continue to aim for best-in-class performance in that setting.

The cool thing about distributed TensorFlow is that it supports efficient synchronous optimizers, so you can scale up the effective batch size by using multiple GPUs, to get increased throughput without losing accuracy.


That study's way out of date - it benchmarked the CuDNNv2 version. Soumith's convnet-benchmarks is much more up-to-date: https://github.com/soumith/convnet-benchmarks

but it hasn't yet been updated to reflect the latest performance improvements in 0.8. We've continued to push on both single-machine and distributed performance, and the next update to soumith's benchmarks should continue to show that improvement.


>> That study's way out of date

I don't know about 'way' out of date, it was first published just a few months ago (November) and the authors pushed a revised version just a few weeks ago (March 30th), but i definitely agree that it's not using the most current implementations

>> Soumith's convnet-benchmarks is much more up-to-date

I'll definitely check these out, thanks for the link


And even those numbers on the front page are out of date :) (we're even faster now: https://github.com/soumith/convnet-benchmarks/pull/96, which is from a few weeks ago.)

The field is moving quickly enough that many published benchmarks are stale within 3 months, and it's a lot of hard work to maintain up to date benchmarks, given how many frameworks there are. Also there are performance/memory/scalability/flexibility tradeoffs everywhere, so it's hard to capture everything in one number without a tremendous number of caveats.


vrv addressed why I called it "way" out of date - in the time since the study was done with CuDNNv2, we've moved TensorFlow to CuDNN v4, and NVidia released the CuDNNv5 release candidate a week ago. Each of those releases provides a pretty big speed bump for specific types of DNNs, and we've been pushing out some very significant speed bumps for TensorFlow at the same time.

My conclusion from this is that Soumith's approach to having a living repository is the way to go. It's harder to call it a "publication", but it's providing something of more lasting value than a static performance snapshot in a field where the engineering is moving so quickly.


I wonder how effective this would be on a fleet of raspberry pis. With things like Resin.io, Weave, and Kubernetes, I wonder if it would be possible to create something like Seti@home for crowdsourced machine learning for all kinds of different applications. Many of us have spare raspberry pis laying around that could be utilized in a global network.


You'd probably have to scale to hundreds or thousands of pis to achieve the performance you could see from a single $100-200 GPU.


One person has managed to successfully build a non-accelerated version of TensorFlow for the RaspberryPi. It can use a network, but training will be painfully, PAINFULLY slow (as in months or years of wall-clock time).

Maybe at some point it will be viable, but not with the hardware and software as it is at the moment.


Notice that while 8 gpu are 8x as effective as 1, 16 gpu are only 15x and a hundred gpu doesn't even get you 70x speedup.

I doubt your idea would prove efficient.


At the very least, computation could be distributed at the hyperparameter tuning stage. Each node would be responsible for training on a different set of hyperparameters. The master node would coordinate the selection and distribution of new hyperparameter sets, amortizing the data-set distribution time.

It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.

High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.


Moving data would be a bottleneck for sure. Distributing the model itself with the state of the calculation and the required samples is just too much compared to the CPU power a raspberry pi can provide.


I would have liked more detail about the cluster. You would still need high speed interconnects (like Infiniband) between the nodes/machines so I don't think crowdsourcing would work.

This could be interesting if ported to an FPGA though. That could give you that power/performance tradeoff.


It depends on what kind of algorithms you are using for classification/ml. Some algorithms can be distributed easily, like recommendation engines, etc. Others, like say SVMs, are much harder to distribute.

If you checkout apache mahout you can get an idea of what is possible and what is not.


I'm getting 404s for some of the tutorial sections when selecting r0.8 (e.g. https://www.tensorflow.org/versions/r0.8/tutorials/mnist/tf/...). master works. Seems like some of the documentation is only built for master and for r0.7, not for r0.8.


(Do you have an example link that doesn't work? I clicked a bunch of links there and they were all working. Feel free to file a bug at github.com/tensorflow/tensorflow)


The link I gave, and others, repeatedly didn't work when I tried them, but now they seem to work!


Very cool! Any progress on Windows support?



I doubt it is a priority. But I can certainly recommend Amazon GPU-enabled instances (~0.6$/hour/GPU, not that much actually).


Linux is great for training, however I would like to deploy my models to run locally on user machines which are running Windows. Theano supports Windows but TensorFlow doesn't.


Can models be ported between systems easily?


g2 instances has a GPU which is not compatible with the stock tensorflow, you must rebuild it from source. Do you have a workaround for that?


I believe our published wheels now include the code for cuda compute 3.0, so it should work out of the box now.

(as long as the images have cudnn v4 and cuda 7.5 installed, I think :)


Great news, I'll try today!


I find the Amazon GPU prices pretty high in the long run. The g2.2xlarge is around 3x slower than a GTX 980.


It should be a priority. The main reason I picked Theano (and still use it) is Windows support.


I can predict in 10 years we will see a rise of computer psychotherapists.


Nice; it only took them 7 months to catch up to amazon:

http://www.nikkostrom.com/publications/interspeech2015/strom...


For others who may be interested in the details despite the uninformative tone of this comment: The Amazon paper is about a specific tweak to learning rates for better scalability when doing distributed training. The core principles of distributed DNN training are much older - for example, Dean et al. 2012: https://papers.nips.cc/paper/4687-large-scale-distributed-de... trained a model for Imagenet on 2000 cores, using the DistBelief framework that is the predecessor to TensorFlow.

The question of how to improve the multiple-replica scaling of distributed DNN training is very important, as is the question of creating usable, flexible, and high-performance abstractions in which to implement one. They're also fairly orthogonal. TensorFlow as an architecture focuses on the latter. One could imagine implementing the Amazon tweak within either TF or any other framework.


Funny comment! Except this is actually usable by developers with non-expert knowledge of neural nets




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: