Oldie paper but still a goodie. Some other algorithms to consider... Random Fore...

tocomment · on Dec 18, 2012

What's so good about random forests? I seem to be hearing it mentioned a lot lately.

nwenzel · on Dec 18, 2012

Random Forests are interesting because they are designed to make many, many bad decisions that cancel each other out leaving only the good decisions behind.

To oversimplify, each of the many, many decision trees is going to do a poor job building a classifier. Each tree only uses a subset of the data and a subset of the features. So, the results of each tree won't generalize well (and may not even do well classifying the data it was trained on). But, for each tree that misclassifies in one direction, another tree will misclassify in the other direction. The theory (which appears to be validated in practice) is that poor decisions cancel each other out but good decisions don't get canceled out.

Although, it isn't quite a simple as a "good" tree and a "bad" tree. All decision trees are bad but there should randomly be good parts in the bad trees. Each tree is going to be "mostly crap" (as Jeremy Howard puts it). But "mostly crap" means "a little good." The crap in one tree cancels out the crap in another tree. The good parts are left behind... and... like magic... you have a good classifier that is fast although somewhat of a black box.

Jeremy Howard from Kaggle is a big believer and where most of my knowledge of Decision Tree Ensemble algorithms come from: https://www.kaggle.com/wiki/RandomForests

_delirium · on Dec 18, 2012

They were one of the first widespread ensemble methods (based on bagged decision trees), and share with other bagging/boosting methods fairly good empirical performance on a wide range of problems.

tocomment · on Dec 18, 2012

Are they easy to interpret? Is it ok if I have a lot of columns in my data?

textminer · on Dec 18, 2012

You will find no understanding in the grown trees-- by definition, they're each seeing a different angle of the data, and each node a different subset of available features.

You can, however, calculate importance scores for the features used. Brennan's original paper gives a good algorithm for doing this (in short: for each tree, permuting the data along some feature for an out of bag sample and seeing how much worse it does.)

textminer · on Dec 18, 2012

(Pardon me/blame autocorrect: Breiman.)

glogla · on Dec 18, 2012

Depend on the number of trees, and their depth. Even with 20 trees, with effort, it could be written on paper as rules, for example. And it helps if most of the trees are depth 3 or something.

But then, some models are pretty much impossible to understand or interpret.

_delirium · on Dec 18, 2012

They handle large-dimensional spaces fairly well, which is one of their strengths. They are not all that interpretable, though.

mturmon · on Dec 18, 2012

Crudely: combine the speed of classification trees with the generally low error rates (flexibility) of the SVM.

sgt101 · on Dec 18, 2012

There is a good implementation in Mahout and it scales well on Hadoop, so its a pretty good out of the box whole data set learning algorithm.

No advantage over taking a decent statistically valid subset and running C4.5 (or a variant with floating point splits) apart from it's less work.

msellout · on Dec 19, 2012

A random forest builds a diversified portfolio of decision trees.

To the extent that the errors of the individual trees are uncorrelated, the random forest keeps the same low expected error of a single tree while reducing the high error variance of a tree. This is the same reason a mutual fund is better than a single stock -- on average they have the same return, but the latter has far more variance.