Is OpenCL anywhere on the roadmap? I now make laptop and desktop purchasing decisions almost entirely on Nvidia card presence. One reason I didn't get the latest MacBook Pro.
AMD is so woefully behind the curve in gpgpu and especially deep learning that their management should be replaced. They think they're still competing with Intel (they're not). Nvidia has a wide open field for the foreseeable future and this will end up being a bad thing for consumers.
Couldn't agree more. They need to diversify and really push gpu compute and openCL. Where's the equivalent of CuDNN for openCL? How many AMD engineers would it take it build that and what would the impact be?
I don't think Google is working on it. Intel and AMD have some comments on the relevant issue[1].
As far as I can see there seems to be a lot of noise ("How do I run Caffe on OpenCL") and not huge amounts of progress.
The truth is that you are better off waiting for external NVidia GPUs[2] to become widely available[3] than waiting for a decent OpenCL implementation.
Can anybody offer a TLDR of how this works (or point me to one)? It seems particularly well-suited for convolutional nets with many layers if I understand correctly, but I am curious as to whether e.g. recurrent nets may receive the same speed-ups from parallelization.
People at Google use multiple replicas to train RNNs and LSTMs to very good effect.
At heart, the most common distributed training mechanism creates multiple "replias" of the model -- each replica has a full copy. It splits the training data among the replias, and then at the end of every batch, synchronizes the updates to the model weights between the replicas. (A simple way would be to think of taking the average of the gradients produced at each replica, and having all replicas apply the average gradient. Equivalently, just reduce the per-replica learning rate and apply all of the gradients, then propagate back the new state to everyone.)
We've made good progress on single-machine performance and memory usage in the latest release, especially for convolutional models (such as adding support for NCHW data layouts), and we'll continue to aim for best-in-class performance in that setting.
The cool thing about distributed TensorFlow is that it supports efficient synchronous optimizers, so you can scale up the effective batch size by using multiple GPUs, to get increased throughput without losing accuracy.
but it hasn't yet been updated to reflect the latest performance improvements in 0.8. We've continued to push on both single-machine and distributed performance, and the next update to soumith's benchmarks should continue to show that improvement.
I don't know about 'way' out of date, it was first published just a few months ago (November) and the authors pushed a revised version just a few weeks ago (March 30th), but i definitely agree that it's not using the most current implementations
>> Soumith's convnet-benchmarks is much more up-to-date
I'll definitely check these out, thanks for the link
The field is moving quickly enough that many published benchmarks are stale within 3 months, and it's a lot of hard work to maintain up to date benchmarks, given how many frameworks there are. Also there are performance/memory/scalability/flexibility tradeoffs everywhere, so it's hard to capture everything in one number without a tremendous number of caveats.
vrv addressed why I called it "way" out of date - in the time since the study was done with CuDNNv2, we've moved TensorFlow to CuDNN v4, and NVidia released the CuDNNv5 release candidate a week ago. Each of those releases provides a pretty big speed bump for specific types of DNNs, and we've been pushing out some very significant speed bumps for TensorFlow at the same time.
My conclusion from this is that Soumith's approach to having a living repository is the way to go. It's harder to call it a "publication", but it's providing something of more lasting value than a static performance snapshot in a field where the engineering is moving so quickly.
I wonder how effective this would be on a fleet of raspberry pis. With things like Resin.io, Weave, and Kubernetes, I wonder if it would be possible to create something like Seti@home for crowdsourced machine learning for all kinds of different applications. Many of us have spare raspberry pis laying around that could be utilized in a global network.
One person has managed to successfully build a non-accelerated version of TensorFlow for the RaspberryPi. It can use a network, but training will be painfully, PAINFULLY slow (as in months or years of wall-clock time).
Maybe at some point it will be viable, but not with the hardware and software as it is at the moment.
At the very least, computation could be distributed at the hyperparameter tuning stage. Each node would be responsible for training on a different set of hyperparameters. The master node would coordinate the selection and distribution of new hyperparameter sets, amortizing the data-set distribution time.
It would also be possible to distribute computation of batches across nodes. Each node would compute the gradients on its batch, and the master would combine gradients and distribute new weights.
High-speed interconnects (e.g. Infiniband) are not needed in this scenario, and the bandwidth usage scales according to the size of the weights and/or gradients, not the data-set size.
Moving data would be a bottleneck for sure. Distributing the model itself with the state of the calculation and the required samples is just too much compared to the CPU power a raspberry pi can provide.
I would have liked more detail about the cluster. You would still need high speed interconnects (like Infiniband) between the nodes/machines so I don't think crowdsourcing would work.
This could be interesting if ported to an FPGA though. That could give you that power/performance tradeoff.
It depends on what kind of algorithms you are using for classification/ml. Some algorithms can be distributed easily, like recommendation engines, etc. Others, like say SVMs, are much harder to distribute.
If you checkout apache mahout you can get an idea of what is possible and what is not.
(Do you have an example link that doesn't work? I clicked a bunch of links there and they were all working. Feel free to file a bug at github.com/tensorflow/tensorflow)
Linux is great for training, however I would like to deploy my models to run locally on user machines which are running Windows. Theano supports Windows but TensorFlow doesn't.
For others who may be interested in the details despite the uninformative tone of this comment: The Amazon paper is about a specific tweak to learning rates for better scalability when doing distributed training. The core principles of distributed DNN training are much older - for example, Dean et al. 2012: https://papers.nips.cc/paper/4687-large-scale-distributed-de... trained a model for Imagenet on 2000 cores, using the DistBelief framework that is the predecessor to TensorFlow.
The question of how to improve the multiple-replica scaling of distributed DNN training is very important, as is the question of creating usable, flexible, and high-performance abstractions in which to implement one. They're also fairly orthogonal. TensorFlow as an architecture focuses on the latter. One could imagine implementing the Amazon tweak within either TF or any other framework.