TensorFlow – Consise Examples for Beginners

jszymborski · on May 31, 2016

I find it so aggravating that nearly every last ML framework documents their CNN libraries in terms of canned MNIST datasets imported from the library in a preprocessed form.

It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.

Examples should be way more general. The starting point shouldn't be:

  from tensorflow.examples.tutorials.mnist import input_data
  mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

, it should start with "here is a directory of images and their classes" and end with a CNN model.

EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.

adhsu01 · on May 31, 2016

The cifar_10 example code is a good starting point: https://github.com/tensorflow/tensorflow/tree/r0.8/tensorflo...

Read the how to on reading data from files: https://www.tensorflow.org/versions/r0.8/how_tos/reading_dat...

and check out this useful Stack Overflow answer: http://stackoverflow.com/questions/33648322/tensorflow-image...

mmanfrin · on May 31, 2016

A few of the lessons on Udacity's Dada Science course cover finding/sifting through datasets and formatting them in a way to work with:

https://classroom.udacity.com/courses/ud359

zellyn · on May 31, 2016

"Dada Science"… fantastic!

agibsonccc · on May 31, 2016

That's typically going to be problem specific. A simple cat/dog tutorial would be reasonably easy with a lot of the already existing libraries like scikit-image/PIL.

The other problem here though is corpus layout. Image net and the academic datasets typically require special readers, but even then: for actual datasets there's a few ways to do the corpus layout. In our experience there are 2 things that have worked well have been: a folder per label or labels in the name.

Then there's still having balanced minibatches though. When you get to having balanced minibatches images segmented by folder means you don't get balanced minibatches out.

Then there's disk to think about, do I really want to re run the same pre processing every time I train? Then: how do I explain that to a beginner?

So I'm probably going to want to have a corpus generator where we end up with a pre saved/balanced minibatches for training..which leads us back to what you see now.

A good middle ground here might be a corpus generator that takes all the minor stuff like that in to consideration...but still data is messy.

I would suggest looking at the wealth of imaging libraries out there in python and building something based on a "from scratch" image corpus.

Minor plug: We thought about that a lot in building deeplearning4j. http://deeplearning4j.org/canova

You may not use java but the idea of "vectorization" is still a good one I think any ml practitioner who's touched pandas could appreciate. We built an abstraction called a datasetiterator which auto magically returns the batches for people so they don't have to think about the details but still having access to "real" data. I'm not sure what the python equivalent to this would be though.

shepardrtc · on May 31, 2016

The format for the dataset is here: http://yann.lecun.com/exdb/mnist/

A good exercise would be to figure out how to extract the data and put it into a numpy array. Then you can test on most - if not all - of the frameworks.

gambler · on May 31, 2016

I recently parsed that in C#. It's simple enough to do, but the format is weird. It would be much simpler if the digits were stored as individual raw files in folders corresponding to each label.

shas3 · on May 31, 2016

Larger problem: mnist, images, and speech are probably low hanging fruits with rich features, large datasets, balanced datasets, etc. for which off the shelf packages work very well today.

mikecb · on May 31, 2016

Data cleaning is always a problem in statistics, so much so that most data scientists say that most of their time is spent on it.

dmix · on June 1, 2016

I'm totally new to TensorFlow and ML in general, but I've been curious about how this could fit into a system.

Say you need a CNN text classifier algorithm to categorize simple single page documents. So you set one up via TensorFlow, train it with a big dataset, and get it outputting categories with decent accuracy. Could you then use some type of API and query it from a (low traffic) production web app?

Or is it more for the research phase rather than real-time interaction?

adyus · on June 2, 2016

Take a look at https://tensorflow.github.io/serving/serving_basic and https://tensorflow.github.io/serving/serving_advanced

oh_sigh · on June 1, 2016

Yes, you could totally do that. Consider that the google translate app packages their trained network into the app itself, so it can do offline translation directly on the phone itself

Omnipresent · on May 31, 2016

I am looking for a TensorFlow resource that shows how to classify images of a certain kind. For example: given 100 images, find the ones that might contain a soccer ball. I haven't come across such a learning resource with TensorFlow. Has anyone else?

bdupharm · on May 31, 2016

Why don't you just start with the CIFAR-10 example and go from there? https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/...

jdoliner · on May 31, 2016

Can we get a mod fix on the title?

rrggrr · on May 31, 2016

This is not for beginners. Without simple examples and example datasets the concepts remain laregely inaccessible to a real beginner.

aanchan · on May 31, 2016

Irrelevant comment: Title 'consise' -> 'concise'

struct · on May 31, 2016

Thanks for this! I recently ported something over from Theano to TensorFlow (I wrote it up at [1]) and I have to admit I generally enjoyed the experience a great deal, even if the single-GPU performance wasn't good enough to make me switch. The TFLearn library (especially [2]) looks very compelling for prototyping however, some I'm very excited to see how the project develops.

[1] https://medium.com/@sentimentron/faceoff-theano-vs-tensorflo...

[2] https://github.com/tflearn/tflearn/blob/master/examples/nlp/...

guan · on May 31, 2016

This is missing a FizzBuzz example.

visarga · on June 1, 2016

That's an advanced use case.

boundlessdreamz · on June 1, 2016

Does someone point me in the right direction for using a neural network for prediction (the input will be time series data). All the beginner's tutorials I have seen deal with classification and not prediction.

moultano · on June 1, 2016

boundlessdreamz · on June 1, 2016

Thank you!

Omnipresent · on May 31, 2016

curated list of dedicated resources: https://github.com/jtoy/awesome-tensorflow

d136o · on June 1, 2016

That looks helpful. In case you want something shorter to look through here's a very concise set of notes I took for myself when I first played with TF:

https://gist.github.com/d136o/4c68d010ecb0abfc7c52

bdi73849 · on May 31, 2016

Great examples! I find it more accessible for me than the official tutorial.

arcanus · on May 31, 2016

Anyone know of any benchmarks on tensorflow that provide a baseline estimate of the expected performance for some off-the-shelf hardware(s)?

krasin · on May 31, 2016

These are the benchmarks supported by Soumith Chintala (Facebook AI Research):

https://github.com/soumith/convnet-benchmarks

They place TensorFlow performance on par with Torch (within 10%).

vonnik · on May 31, 2016

Typo: it should be concise with two c's and one s.

jackson_1 · on May 31, 2016

In the most compact way possible:

TensorFlow is like Numpy, only it is capable of working symbolically and can work more easily with your GPU in a highly parallel manner. For those two reasons, it is vastly superior to numpy for tasks like deep learning / machine learning.

Theano is another symbolic numerical library, coming before TensorFlow, though TensorFlow has seemingly gained more popularity and people have more general faith in it.