Hacker News new | past | comments | ask | show | jobs | submit login
TensorFlow – Consise Examples for Beginners (github.com/aymericdamien)
340 points by aymericdamien on May 31, 2016 | hide | past | favorite | 30 comments



I find it so aggravating that nearly every last ML framework documents their CNN libraries in terms of canned MNIST datasets imported from the library in a preprocessed form.

It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.

Examples should be way more general. The starting point shouldn't be:

  from tensorflow.examples.tutorials.mnist import input_data
  mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
, it should start with "here is a directory of images and their classes" and end with a CNN model.

EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.


The cifar_10 example code is a good starting point: https://github.com/tensorflow/tensorflow/tree/r0.8/tensorflo...

Read the how to on reading data from files: https://www.tensorflow.org/versions/r0.8/how_tos/reading_dat...

and check out this useful Stack Overflow answer: http://stackoverflow.com/questions/33648322/tensorflow-image...


A few of the lessons on Udacity's Dada Science course cover finding/sifting through datasets and formatting them in a way to work with:

https://classroom.udacity.com/courses/ud359


"Dada Science"… fantastic!


That's typically going to be problem specific. A simple cat/dog tutorial would be reasonably easy with a lot of the already existing libraries like scikit-image/PIL.

The other problem here though is corpus layout. Image net and the academic datasets typically require special readers, but even then: for actual datasets there's a few ways to do the corpus layout. In our experience there are 2 things that have worked well have been: a folder per label or labels in the name.

Then there's still having balanced minibatches though. When you get to having balanced minibatches images segmented by folder means you don't get balanced minibatches out.

Then there's disk to think about, do I really want to re run the same pre processing every time I train? Then: how do I explain that to a beginner?

So I'm probably going to want to have a corpus generator where we end up with a pre saved/balanced minibatches for training..which leads us back to what you see now.

A good middle ground here might be a corpus generator that takes all the minor stuff like that in to consideration...but still data is messy.

I would suggest looking at the wealth of imaging libraries out there in python and building something based on a "from scratch" image corpus.

Minor plug: We thought about that a lot in building deeplearning4j. http://deeplearning4j.org/canova

You may not use java but the idea of "vectorization" is still a good one I think any ml practitioner who's touched pandas could appreciate. We built an abstraction called a datasetiterator which auto magically returns the batches for people so they don't have to think about the details but still having access to "real" data. I'm not sure what the python equivalent to this would be though.


The format for the dataset is here: http://yann.lecun.com/exdb/mnist/

A good exercise would be to figure out how to extract the data and put it into a numpy array. Then you can test on most - if not all - of the frameworks.


I recently parsed that in C#. It's simple enough to do, but the format is weird. It would be much simpler if the digits were stored as individual raw files in folders corresponding to each label.


Larger problem: mnist, images, and speech are probably low hanging fruits with rich features, large datasets, balanced datasets, etc. for which off the shelf packages work very well today.


Data cleaning is always a problem in statistics, so much so that most data scientists say that most of their time is spent on it.


I'm totally new to TensorFlow and ML in general, but I've been curious about how this could fit into a system.

Say you need a CNN text classifier algorithm to categorize simple single page documents. So you set one up via TensorFlow, train it with a big dataset, and get it outputting categories with decent accuracy. Could you then use some type of API and query it from a (low traffic) production web app?

Or is it more for the research phase rather than real-time interaction?



Yes, you could totally do that. Consider that the google translate app packages their trained network into the app itself, so it can do offline translation directly on the phone itself


I am looking for a TensorFlow resource that shows how to classify images of a certain kind. For example: given 100 images, find the ones that might contain a soccer ball. I haven't come across such a learning resource with TensorFlow. Has anyone else?


Why don't you just start with the CIFAR-10 example and go from there? https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/...


Can we get a mod fix on the title?


This is not for beginners. Without simple examples and example datasets the concepts remain laregely inaccessible to a real beginner.


Irrelevant comment: Title 'consise' -> 'concise'


Thanks for this! I recently ported something over from Theano to TensorFlow (I wrote it up at [1]) and I have to admit I generally enjoyed the experience a great deal, even if the single-GPU performance wasn't good enough to make me switch. The TFLearn library (especially [2]) looks very compelling for prototyping however, some I'm very excited to see how the project develops.

[1] https://medium.com/@sentimentron/faceoff-theano-vs-tensorflo...

[2] https://github.com/tflearn/tflearn/blob/master/examples/nlp/...


This is missing a FizzBuzz example.


That's an advanced use case.


Does someone point me in the right direction for using a neural network for prediction (the input will be time series data). All the beginner's tutorials I have seen deal with classification and not prediction.


LSTM


Thank you!


curated list of dedicated resources: https://github.com/jtoy/awesome-tensorflow


That looks helpful. In case you want something shorter to look through here's a very concise set of notes I took for myself when I first played with TF:

https://gist.github.com/d136o/4c68d010ecb0abfc7c52


Great examples! I find it more accessible for me than the official tutorial.


Anyone know of any benchmarks on tensorflow that provide a baseline estimate of the expected performance for some off-the-shelf hardware(s)?


These are the benchmarks supported by Soumith Chintala (Facebook AI Research):

https://github.com/soumith/convnet-benchmarks

They place TensorFlow performance on par with Torch (within 10%).


Typo: it should be concise with two c's and one s.


In the most compact way possible:

TensorFlow is like Numpy, only it is capable of working symbolically and can work more easily with your GPU in a highly parallel manner. For those two reasons, it is vastly superior to numpy for tasks like deep learning / machine learning.

Theano is another symbolic numerical library, coming before TensorFlow, though TensorFlow has seemingly gained more popularity and people have more general faith in it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: