I find it so aggravating that nearly every last ML framework documents their CNN libraries in terms of canned MNIST datasets imported from the library in a preprocessed form.
It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.
Examples should be way more general. The starting point shouldn't be:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
, it should start with "here is a directory of images and their classes" and end with a CNN model.
EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.
That's typically going to be problem specific. A simple cat/dog tutorial would be reasonably easy with a lot of the already existing libraries like scikit-image/PIL.
The other problem here though is corpus layout. Image net and the academic datasets typically require special readers, but even then: for actual datasets there's a few ways to do the corpus layout. In our experience there are 2 things that have worked well have been: a folder per label or labels in the name.
Then there's still having balanced minibatches though. When you get to having balanced minibatches images segmented by folder means you don't get balanced minibatches out.
Then there's disk to think about, do I really want to re run the same pre processing every time I train? Then: how do I explain that to a beginner?
So I'm probably going to want to have a corpus generator where we end up with a pre saved/balanced minibatches for training..which leads us back to what you see now.
A good middle ground here might be a corpus generator that takes all the minor stuff like that in to consideration...but still data is messy.
I would suggest looking at the wealth of imaging libraries out there in python and building something based on a "from scratch" image corpus.
You may not use java but the idea of "vectorization" is still a good one I think any ml practitioner who's touched pandas could appreciate. We built an abstraction called a datasetiterator which auto magically returns the batches for people so they don't have to think about the details but still having access to "real" data. I'm not sure what the python equivalent to this would be though.
A good exercise would be to figure out how to extract the data and put it into a numpy array. Then you can test on most - if not all - of the frameworks.
I recently parsed that in C#. It's simple enough to do, but the format is weird. It would be much simpler if the digits were stored as individual raw files in folders corresponding to each label.
Larger problem: mnist, images, and speech are probably low hanging fruits with rich features, large datasets, balanced datasets, etc. for which off the shelf packages work very well today.
I'm totally new to TensorFlow and ML in general, but I've been curious about how this could fit into a system.
Say you need a CNN text classifier algorithm to categorize simple single page documents. So you set one up via TensorFlow, train it with a big dataset, and get it outputting categories with decent accuracy. Could you then use some type of API and query it from a (low traffic) production web app?
Or is it more for the research phase rather than real-time interaction?
Yes, you could totally do that. Consider that the google translate app packages their trained network into the app itself, so it can do offline translation directly on the phone itself
I am looking for a TensorFlow resource that shows how to classify images of a certain kind. For example: given 100 images, find the ones that might contain a soccer ball. I haven't come across such a learning resource with TensorFlow. Has anyone else?
Thanks for this! I recently ported something over from Theano to TensorFlow (I wrote it up at [1]) and I have to admit I generally enjoyed the experience a great deal, even if the single-GPU performance wasn't good enough to make me switch. The TFLearn library (especially [2]) looks very compelling for prototyping however, some I'm very excited to see how the project develops.
Does someone point me in the right direction for using a neural network for prediction (the input will be time series data). All the beginner's tutorials I have seen deal with classification and not prediction.
That looks helpful. In case you want something shorter to look through here's a very concise set of notes I took for myself when I first played with TF:
TensorFlow is like Numpy, only it is capable of working symbolically and can work more easily with your GPU in a highly parallel manner. For those two reasons, it is vastly superior to numpy for tasks like deep learning / machine learning.
Theano is another symbolic numerical library, coming before TensorFlow, though TensorFlow has seemingly gained more popularity and people have more general faith in it.
It's always left as a useless exercise for the reader to divine how to generate such a dataset from his/her own data.
Examples should be way more general. The starting point shouldn't be:
, it should start with "here is a directory of images and their classes" and end with a CNN model.EDIT: Should anyone have any insight as to where I might get such a tutorial (or have the desire to write one), I know a herd of ML pre-initiates that would be grateful.