Machine learning for the impatient: algorithms tuning algorithms

danger · on July 23, 2012

As another commenter pointed out, the accuracy really needs to be evaluated using a validation set, not the test set--the approach described in the post is training with the testing data. In the field, we call this "cheating".

The basic idea of automatically tuning hyperparameters (the things this post discusses tuning with genetic algorithms) is cool, though, and is becoming a popular subject in machine learning research. A couple recent research papers on the topic are pretty readable:

Algorithms for Hyper-Parameter Optimization:

http://books.nips.cc/papers/files/nips24/NIPS2011_1385.pdf

Practical Bayesian Optimization of Machine Learning Algorithms:

http://arxiv.org/abs/1206.2944

danger · on July 23, 2012

Also, for people who are interested in the application of predicting college basketball with machine learning, there's a Google group that is worth joining:

https://groups.google.com/group/machine-march-madness

aelaguiz · on July 23, 2012

Thanks for the information! I've updated the article to reflect this.

Here's a question: where does "the field" hang out? Is there a cohesive community of any sort?

danger · on July 23, 2012

I'd say the closest thing to a cohesive community would be the MetaOptimize Q&A forum, but maybe others have other suggestions:

http://metaoptimize.com/qa

gammarator · on July 24, 2012

The Kaggle forums can be a good resource, and competing there is a good way to polish your skills.

Also, you might check out Random Forest algorithms--they're high performance but still very beginner friendly, as there aren't many parameters to tune. There's a nice implementation in the excellent scikits-learn python library.

pjin · on July 23, 2012

Hmm, so gaussian distributions are easy to use and ubiquitous and all (they're the basis functions used in SVM), except that I don't see any reason for them to be priors here. But since 2012 > 2008 I feel like I'm obligated (and I'm semi-trolling) to point out the obvious about lazy assumptions based on "flexibility and tractability", which is that they can implode hilariously in your face. C.f. the financial crisis.

[1] http://econometricsense.blogspot.com/2011/03/copula-function...

bayesregressor · on July 23, 2012

I think you're confusing the Gaussian "process" used in Bayesian optimization with a standard Gaussian distribution. They are very different things - as are Gaussian copulas and what is referred to as the 'Gaussian kernel' (which is not actually a distribution at all) in the SVM. The Gaussian process is a distribution over functions, the properties of which are governed by the covariance function - so the prior over the function, or the assumption about its complexity and form, is determined by the choice of covariance function. Of course it is very important to choose a prior that corresponds to the functional form you are interested in, which is actually discussed and empirically validated in the literature referred to in that post. It's a bit ironic that you are claiming to point out the dangers of making lazy assumptions by doing exactly that.

scottfr · on July 23, 2012

Such aggressive usage of the test data set in determining the tuning parameters in effect makes your test data set part of your training data set.

The more times you go back to your test data set to evaluate the effectiveness of a model, the more optimistic your error predictions will be and the greater your chance of overfitting. Several iterations of his loop will probably improve the model, but if you keep repeating it eventually the true model performance will start to degrade.

brador · on July 23, 2012

Question: Why is the data for machine learning split into a training dataset and a test dataset? Wouldn't using the entire dataset to build the model result in greater accuracy of the predictions?

reginaldo · on July 23, 2012

When you develop a model, for instance, when implementing a classifier, you supposedly want to apply the developed model to other data, i.e., data you don't have available during development.

In many situations, it doesn't make sense to test your model only when it's put to make or influence decisions in the real world (although you have to test in the real world too). You'll want to test the predictions of your model on data you already have the actual results for. To test your model you'll split your data into data you know and will let the model know about (training dataset), and data you know but the model can't know about (test dataset). That way you can use the data the model doesn't know about to make controlled experiments and compare models (and, if your data is really representative of the real world, your mofrl comparison and the performance of your chosen model will hold).

The moral of the story is: if you don't split your data, you won't have any idea of how it performs in the real world, you'll only know how it performs with data it already knew about.

brador · on July 23, 2012

I understand this is the way it is done, however, it all feels a bit hand wavy and not how my gut tells me to do it.

I'd get the entire dataset, set columns as variables, assign weight to each of these variables and process each weight in 0.1 increments. (So the final number of passes is 11^n where n is number of variables). I'd have something at the end of the row to know what was predicted right (+1) and what was predicted wrong (-1), sum this column. Hit run until optimal weights for each variable are found. I'd use the entire dataset to do this.

Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

dandelany · on July 24, 2012

> I understand this is the way it is done, however, it all feels a bit hand wavy and not how my gut tells me to do it.

I know what you mean, I felt the same way when I first learned about training and test sets. However...

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

Indeed there is - the proper name for the statistical technique is "cross-validation" and much work has been done on the matter. The most common way to do it is "k-fold cross validation", which, when k=2 simply means to split your data in half randomly and use one for training and one for test - basically what the author did. The generalized k-fold cross-validation means that you slice up your data into k subsets, pick one to use as a test set and use the rest as a training set. Then repeat the process ("folding") k times, picking a different test set every time and averaging the results. This, of course, takes longer, but it has a better gut feel because your algorithm is "learning" from all the data rather than just whatever slice you originally chose as your training set. Wikipedia lists a few more common ways to do it:

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#C...

However, since the author's process involves tuning parameters after testing the algorithm against the test set, I'm not really sure how cross-validation would work... Maybe standard k-fold with three sets instead of two - a training set, an intermediate test set, and a true test set?

ninjin · on July 23, 2012

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

In my own field, Natural Language Processing (NLP), it is either up to the original creators of a dataset or you do your own split if there isn't one established already. I'll go with what I have learned for supervised learning.

In an ideal world all three sets; training, development (the Machine Learning people sometimes call this one verification if I remember correctly) and test should be infinitely large. Also, you should preferably not stratify or try to make the assignment anything but random (there are cases where this could be justified, but let's not go there just yet).

I personally go for a 3/6 train, 1/6 development and 2/6 test, but I have just as well seen 2/4, 1/4 and 1/4, etc. Training is essential, so it gets the biggest cut, testing is important too so it also gets a large chunk and development is the least important out of the three so it gets the smaller one. In short, train is for making sure your algorithm can learn something, development is in order to guide your development and not fool yourself, lastly, test is in order to be able to make claims that you state to other people (thus, it is pretty darn important).

I then use the train and development set when constructing the model, I do the write-up of most of my results and then generate the final results by running on the test set only once with the model that performed the best on development. What you usually see is a drop in performance, but this is expected since you have most likely overfitted the development set. Since the hyper parameters need to be adjusted as well I commonly do ten-fold cross-validation on the test set and use some variant of grid-search (read yesterday that this approach for hyper parameters is coming under fire as being naive, I need to have a look at what has been going on in ML for the last two years).

ninjin · on July 24, 2012

> I commonly do ten-fold cross-validation on the test set...

Um, darn, edit period ran out, test set should obviously be train set in the above quote. Otherwise my PI would probably smack me in the face for overfitting the test set.

danger · on July 23, 2012

Another way to say what you're proposing is to use a linear predictor, and to train it by globally optimizing 0-1 loss (just the number of mistakes that you make) on the full set of data that you have. Even ignoring computational issues (this can be shown to be NP-hard), you seem to be making several mistaken assumptions. I'd really recommend reading some basic stuff on generalization, but a couple of the mistaken assumptions are as follows:

1. That the particular input features that you've chosen are somehow the only possible choice. But who's to say that you shouldn't add new features which are the square of each original feature? Or maybe some cross product terms, like the product of the ith feature times the jth feature. Or maybe some good features to add would be the distance from each point you've seen so far. Etc. Continuing down this path, you basically get to the question discussed in the OP about choosing a kernel for SVMs. This is just one example where hyperparameters come into play, and you need some method for choosing them.

2. That a linear predictor is impervious to overfitting. Consider the extreme case (which comes up often) where you have millions or billions of features and far fewer examples (e.g., if features are n-gram occurrences in text, or gene expression data). Then it's likely that there are many settings of weights that fit the data perfectly, but there's no way to tell if you're just picking up on statistical noise, or if you've learned something that will make good predictions on new data that you encounter. In both theory and practice, you need some form of regularization, and along with this comes more hyperparameters, which need to be chosen.

Finally, by your reasoning, it seems like you would always choose a 1-nearest neighbors classifier [1] (because it will always end up with 0 error under the setting your propose). But there's no reason why this is in general a good idea.

[1] http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

_ezkx · on July 23, 2012

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

I think you're thinking about this wrong. You need to set aside some of your data for testing/cross-validation, otherwise you have no way of knowing even if you have enough data to train your model! i.e. you want to get to the point where additional training data isn't getting you a better model, and if you're not at that point you should probably collect more data.

brador · on July 23, 2012

It seems I'm thinking of a prediction algorithm and being discussed is a genetic algorithm, for which I can understand the necessity of a training set and a test set.

Either way, my knowledge on machine learning is lacking, time to hit up Google.

msellout · on July 23, 2012

You might be interested in Andrew Ng's ML course via Coursera.

abecedarius · on July 23, 2012

Solomonoff induction is a principled but uncomputable way to do without a test set. There might be a useful approximation to it for your problem.

aelaguiz · on July 23, 2012

For this I actually had sufficient data available that adding more didn't do much.

You need a testing dataset in order to validate the performance of the model. If you validate against the training set really what you're doing is measuring the model's ability to fit the training data - which it will be able to do with high accuracy. That will, however, result in a much diminished ability to predict any new games - as it isn't "learning" the features of college basketball as much as it is memorizing the contents of the training set.

As others pointed out, it'd probably even be better to add a third grouping which is tested against only after the algorithm has finished - as an objective validation against as of yet unseen data.

numlocked · on July 23, 2012

To prevent overfitting. A common technique is to use cross-validation, where you train some percentage of the data (90%) and test on the remaining (10%), and cycle through which 10% you use as test (each train/test split is called a "fold"). Once you've identified a model that resists overfitting, you can train on all of the data.

edit: As noted in the other reply, having a truly blind validation set is still ideal.

btilly · on July 24, 2012

Suppose that I have 2 possible models that I might try to build. I can train them both on my data set, but how do I get a sense as to which is better?

When you have the split, you can then tell which idea seems to work better.

tel · on July 24, 2012

Yes. Training on the test set overestimates performance, cross validation underestimates it.

aelaguiz · on July 23, 2012

Yep, I've definitely seen that. I bet if I were to reserve a portion of the dataset for validation after all tuning, I'd get a much better measure of it's actual ability at prediction. Using it in the way that I did, I wouldn't be surprised if there was significant overfitting. In fact I'd be surprised if there wasn't.

gersh · on July 23, 2012

You can split it into training, validation, and test sets if you have enough data. Then, the validation set would be used for the training. Presumably by using a genetic algorithm, he isn't going for a perfect fit, which may help avoid overfitting.

bencpeters · on July 23, 2012

I had a few questions about the actual implementation of this stuff. I took the coursera ML course, so much of the terminology and techniques are familiar after that, but Professor Ng structured the exercises in the course around Matlab/Octave and suggested using one of these tools for a first-pass solution when implementing machine learning problems.

Have you used Matlab much in your work? How does the performance (and libraries available) compare with Python?

Also, does anyone know a good resource for finding good, high performance ML libraries for other languages (Ruby, C++, etc.)?

ogrisel · on July 23, 2012

matlab and python + numpy + scipy + matplotlib give the same kind of performance for the same level of features and ability to do quick exploratory prototyping with high level, vector based datastructures.

However one is free (beer and freedom) and multipurpose (you can preprocess string data and expose your model with a HTTP API) while the other is much less so.

In python + numpy you should always profile and if the bottleneck is in python interpreted code (rather than a low level numpy call) you can always rewrite the offending python loop in cython.

punee · on July 23, 2012

Scikit-learn for Python, great tutorials/user guide covering various ML techniques, makes prototyping very easy: http://scikit-learn.org/stable/

aelaguiz · on July 23, 2012

I used to prototype with Octave but I actually find I"m much faster in python. I push my tests to EC2 and just use cProfile to make sure I'm not doing anything silly before hand. I parallelize with celery, but I've also used pp with some limited success. Performance is sufficient for my needs currently but not for production or real-time.

kilburn · on July 24, 2012

One of the most famous collections/frameworks for development and usage of (traditional) machine learning algorithms is weka [1]. It is written in java, and actively developed by the machine learning group of the University of Waikato. They also seem to have a forum to discuss, which should be a good resource given that a lot of their users are actually researchers.

The very same group is also developing another framework (also in java), MOA [2], that is oriented towards "big data": the algorithms used in there are specifically tailored for huge datasets and/or realtime (streaming) machine learning.

[1] http://www.cs.waikato.ac.nz/ml/weka/

[2] http://moa.cs.waikato.ac.nz/

misiti3780 · on July 24, 2012

This article is very interesting but I am really confused about the edit:

"A few people have pointed out that using the testing set for tuning demands that final measure of effectiveness be doing using a validation test set which is not part of either the training or testing datasets. This is due for the very real potential of over fitting. Also – apparently this technique is called “Hyper-Parameter Optimization.” A helpful commenter over at Hacker News supplied the following resources"

Does that mean there is definitely overfitting going on but it is acceptable for the purposes of the article. That 82.5% accuracy rate has overfitting written all over it.

Good stuff regardless!