Hacker News new | past | comments | ask | show | jobs | submit login

Oldie paper but still a goodie. Some other algorithms to consider... Random Forests, Gradient Boosted Machines (similar to AdaBoost), Logistic Regression, Neurel Networks.



What's so good about random forests? I seem to be hearing it mentioned a lot lately.


Random Forests are interesting because they are designed to make many, many bad decisions that cancel each other out leaving only the good decisions behind.

To oversimplify, each of the many, many decision trees is going to do a poor job building a classifier. Each tree only uses a subset of the data and a subset of the features. So, the results of each tree won't generalize well (and may not even do well classifying the data it was trained on). But, for each tree that misclassifies in one direction, another tree will misclassify in the other direction. The theory (which appears to be validated in practice) is that poor decisions cancel each other out but good decisions don't get canceled out.

Although, it isn't quite a simple as a "good" tree and a "bad" tree. All decision trees are bad but there should randomly be good parts in the bad trees. Each tree is going to be "mostly crap" (as Jeremy Howard puts it). But "mostly crap" means "a little good." The crap in one tree cancels out the crap in another tree. The good parts are left behind... and... like magic... you have a good classifier that is fast although somewhat of a black box.

Jeremy Howard from Kaggle is a big believer and where most of my knowledge of Decision Tree Ensemble algorithms come from: https://www.kaggle.com/wiki/RandomForests


They were one of the first widespread ensemble methods (based on bagged decision trees), and share with other bagging/boosting methods fairly good empirical performance on a wide range of problems.


Are they easy to interpret? Is it ok if I have a lot of columns in my data?


You will find no understanding in the grown trees-- by definition, they're each seeing a different angle of the data, and each node a different subset of available features.

You can, however, calculate importance scores for the features used. Brennan's original paper gives a good algorithm for doing this (in short: for each tree, permuting the data along some feature for an out of bag sample and seeing how much worse it does.)


(Pardon me/blame autocorrect: Breiman.)


Depend on the number of trees, and their depth. Even with 20 trees, with effort, it could be written on paper as rules, for example. And it helps if most of the trees are depth 3 or something.

But then, some models are pretty much impossible to understand or interpret.


They handle large-dimensional spaces fairly well, which is one of their strengths. They are not all that interpretable, though.


Crudely: combine the speed of classification trees with the generally low error rates (flexibility) of the SVM.


There is a good implementation in Mahout and it scales well on Hadoop, so its a pretty good out of the box whole data set learning algorithm.

No advantage over taking a decent statistically valid subset and running C4.5 (or a variant with floating point splits) apart from it's less work.


A random forest builds a diversified portfolio of decision trees.

To the extent that the errors of the individual trees are uncorrelated, the random forest keeps the same low expected error of a single tree while reducing the high error variance of a tree. This is the same reason a mutual fund is better than a single stock -- on average they have the same return, but the latter has far more variance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: