Appreciate being quoted >:) That aside, you really have to take it in bits. Igno...

metrix · on Jan 29, 2014

I got into machine learning through an article off of HN stating that Random forests would get you 80% of the way (I think they were right!) For my purposes rotation forest increased my accuracy considerably. I have a few questions:

1. I have found that data manipulation and feature creation from a SQL database is harder than the actually using an algorithm, and knowing how to extract and aggregate data seemed to be more like "throw something at the wall and see what sticks" Do you have any suggestions or information on knowing how to extract the best data?

2. After getting a random forest going, I had a hard time figuring out which algorithm to try next, or how to figure out what would work best for my dataset. Any suggestions on how to take the next step?

agibsonccc · on Jan 29, 2014

1. Use what correlates best with the outcomes. Look in to feature selection and principal component analysis for this. This will cause less noise due to smaller feature vectors. It also allows more digestable outcomes. I would also highly reccomend visualization. Weka is great if you want plug and play; otherwise there's the more traditional R/matlab. It really depends on what you're comfortable with.

2 . Depends what kind of learning you're doing. I would look in to multinomial logistic regression for most applications (more than one class) for supervised classification. Then there's also k means if you're looking to understand trends in your data. Keep in mind this is my off the shelf/simple recommendation.

I would love input on a plug and play machine learning CLI. I planned on building out my current project in to a full blown command line app. Since it can handle most features including automatic visualization/debugging via matplolib I figure with some documentation it might be a neat tool for people who don't want to deal with feature selection but still want things simple. It's definitely a problem that there's really no clear way to build simple models. Domain knowledge is also an expensive problem.

metrix · on Jan 29, 2014

Do you have it on a website or github? I would be interested in taking a look at it.

agibsonccc · on Jan 29, 2014

https://github.com/agibsonccc/java-deeplearning/

Keep in mind documentation is one of the things I need to work on the most now. I have it built and ready to go for the most part.