Thanks, great to see that TF recongizing the issue with Pipelines/queues and simplifying it. I wish they did NCHW conversion by automatically detecting GPU.
I think Tensorflow is developed in a very principled manner focusing on ability to add more ops and platforms in future. While this has hurt TF usability in the short term I am bullish on its future.
From the linked article, to help others like me who mostly scan the comments
> PyTorch LSTM network is faster because, by default, it uses cuRNN’s LSTM implementation which fuses layers, steps and point-wise operations. See blog-post on this here.
> Tensorflow’s RNNs (in r1.2), by default, do not use cuDNN’s RNN, and their ‘call’ function describes only one time-step of computation, hence a lot of optimization opportunities are lost. On the flip side, though, this gives user much more flexibility, provided that the user knows what he is doing.
How much of this is solely due to the 1-bit gradient descent? It's a genuine breakthrough and MSR (or whoever came up with it) deserves all the credit, but assuming it's not patented I imagine Google will be adding it to TF sooner or later and that will close most of the gap.
EDIT: My bad, I did not see at the end of the article that 1-bit SGD is not enabled on Keras yet, so the performance wins are coming from somewhere else. Neato.
Assessing accuracy? If you're running the same architecture with the same training regime, why would you expect differing accuracy numbers (other than a bug of some sort)? Seems a bit strange to include.
> TensorFlow shared the training script for Inception V3, and offered pre-trained models to download. However, it is difficult to retrain the model and achieve the same accuracy, because that requires additional understanding of details such as data pre-processing and augmentation. The best accuracy that were achieved by a third party (Keras in this case) is about 0.6% worse that what the original paper reported. Researchers in the CNTK team worked hard and were able to train a CNTK Inception V3 model with 5.972% top 5 error, even better than the original paper reported!
This suggests that an improvement in accuracy is possible by switching to CNTK, hence why I included an accuracy metric from both frameworks. (and also for sanity checking, as you note)
No, if you re-read the first sentence there, it says that the different results are attributed to differences in "pre-processing and augmentation". The choice of NN framework is essentially irrelevant.
Note, though, that the preprocessing and augmentation is (at least in TF) done within the framework itself. I helped debug the pure-TensorFlow version of the Inception input pipeline, and getting it to match the earlier DistBelief version was agonizing -- it really shows all of the differences (and bugs) in the image processing ops. And there can be subtle effects -- differences in which image resizing algorithm you use, for example.
But it's worth noting that this code is all released:
It may be hard to replicate that across all platforms, though -- as an example, the distortions include using four different image resizing algorithms.
That's not really related to the network. When you build a NN, you shouldn't expect different accuracy using different frameworks (given that you're using the same hyper parameters and training/evaluating on the same data).
I mean this in the nicest possible way (so please don't take offense), but I think you're missing what is being said there. The framework itself should have no impact, positive or negative, on model accuracy. That being said, it can be extremely challenging to reproduce results given the stochasticity of random batches and asynchronous updates. Furthermore, precisely specifying the methods of data augmentation can be tedious, thus the protocol is often only partially detailed in published work which further exacerbates the challenges of reproducing results.
We could introduce a simple rule to get rid of this kind of problem: if it can't be reproduced with the data supplied then it isn't true so you can't publish.
The reasons for this are many: different order of operations, failure to propagate random seeds, different order of processing files etc. I don't think there is a structured study of this so that would be a great thing to do!
From my testing before starting the benchmarks, no. I did later search around the internet beforehand and others mention that there is not noticeable overhead.
I did to go through the rigmarole of building tensorflow for GPU training in GCE (building TF: installing nvidia drivers, cudnn, etc) and docker would have definitely been a boon! Or it would be nice of GCE had an image marketplace like AWS.
[1] https://groups.google.com/a/tensorflow.org/d/msg/discuss/Dhy...