Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? [pdf]

benhamner · on May 22, 2015

Random Forests are great on many tasks, but this analysis is incredibly biased: it only includes the incredibly small and simple datasets in the UCI repository. Many real world tasks are far more complex than that, especially those involving text, speech, images, video, and large scale web data.

ely-s · on May 22, 2015

This is interesting experimental evidence in spite of the NFL theorem, which refutes the notion of a generally superior algorithm.

https://en.wikipedia.org/wiki/No_free_lunch_theorem

I would reconcile it by saying that the UCI contains a biased subset of all theoretical classification tasks.

Houshalter · on May 22, 2015

This is not surprising to anyone, because the no free lunch theorem assumes that all datasets are randomly generated. In reality real world problems are probably drawn from some distribution, and datasets are not totally random.

gwern · on May 22, 2015

It makes its point well, but I'd like to see a followup paper addressing neural networks: given the extreme complexity of successful deep neural networks, which outperform anything he considers there on real world problems he doesn't consider, what implications can we draw?

Houshalter · on May 22, 2015

Neural networks tend to overfit very easily, which is the main reason other methods usually outperform them. They are mainly successful where they can exploit the structure of a problem, in ways other methods can't.

E.g. convolutional neural networks take advantage of local structure within images and the fact nearby pixels are related to each other.

However I'd really like to see the reemergence of Bayesian neural networks, which can solve the overfitting problem. Also methods like dropout are relatively new, and alleviate overfitting a lot more than was possible in the past.

minthd · on May 22, 2015

So basically to get 93%(in average) of the value of machine learning, you can use bigml's extremely easy interface[1], even without writing code ?

[1]http://blog.bigml.com/2013/07/01/you-dont-need-coursera-to-g...

sjtrny · on May 21, 2015

Do we need hundreds of reposts?