Machine learning toolkit

gtani · on Feb 3, 2011

Others:

http://gate.ac.uk/

http://rapid-i.com/content/view/181/190/

http://elefant.developer.nicta.com.au/

(tanagra, weka, orange, (depending on what what you're looking for) )

Rickasaurus · on Feb 3, 2011

I worked with this a bit at UMass, it's not bad at all. Also from UMass, be sure to check out Factorie the probabilistic factor graph framework in Scala.

http://code.google.com/p/factorie/

apatry · on Feb 3, 2011

I am currently using this toolkit and I must say that I really like it.

The main advantages of mallet over weka (the main java toolkit used in academic machine learning) for Natural Language Processing are:

- No need to map words and features to position in a feature vector yourself.

- Instances preprocessing can be defined in pipes that can be saved along the models. So no need to remember the pre-processing of data for each experiments.

- Contains algorithms for structured learning (CRF, HMM and general graphic models).

On the other hand, Mallet implements less algorithm (e.g. no Support Vector Machines to my knowledge).

In short, it is a nice toolkit to be aware of if you are planning to do Natural Language Processing.

tensor · on Feb 3, 2011

For anyone wanting to use it in a commercial setting, it's worth noting that weka is GPL and mallet is CPL.

http://en.wikipedia.org/wiki/Common_Public_License

shortlived · on Feb 3, 2011

Can anyone recommend a good intro to machine learning and NLP?

kanak · on Feb 3, 2011

Definitely depends on your math background (knowledge of analysis and linear algebra seems to be particularly helpful).

Witten's Data Mining is a very good beginner's book (has very little math, but lots of good explanations and discussion of real life issues).

Bishop's book is excellent, but it's easy to get lost if you don't have the mathematical background.

Duda, Hart & Stork's Patten Recognition book is also very well organized and has one of the best first chapters in any machine learning book. But it too requires mathematical background to be fully appreciated.

Hastie & Tibshirani's book is written by people from a statistical background, and is very very mathematical. I haven't progressed beyond Chapter 2, and I'm working on improving my math skills before I get back into it.

-- For NLP, a very good intro is the NLTK book.

Jurafsky & Martin's book covers more NLP topics, but Manning and Schutze cover statistical portions in depth. I think you should just read both :D.

JamieEi · on Feb 3, 2011

Pattern Recognition and Machine Learning, Bishop http://www.amazon.com/Pattern-Recognition-Learning-Informati...

jdj · on Feb 3, 2011

Note that this is a book for the more math-inclined, one will have trouble if not familiar with the basics of multivariate analysis/linear algebra.

SeanDav · on Feb 3, 2011

Seems like a good book - bit surprised that it doesn't seem to cover Genetic Algorithms.

levesque · on Feb 3, 2011

Genetic Algorithms are rarely covered in (recent) Machine Learning books. They are more often considered as optimization algorithms.

gtani · on Feb 3, 2011

NLP:

http://www.amazon.com/Foundations-Statistical-Natural-Langua...

http://www.cs.colorado.edu/~martin/slp2.html

(and the python NLTK book is not bad)

jimbokun · on Feb 3, 2011

Some of the other books recommended here are more comprehensive, but I found Mitchell's book better at teaching concepts.

http://www.cs.cmu.edu/~tom/mlbook.html

Simple things, like actually spelling out the steps of a derivation vs. "Here's equation a, from which we trivially derive equation b" but it might take you a while to figure out the steps if you haven't done a lot of Calculus problems recently.

And Mitchell just seems really good at explaining things (having heard him speak in person a few times).

levesque · on Feb 3, 2011

I guess I could recommend to you 'Introduction to Machine Learning' by Ethem Alpaydin : http://mitpress.mit.edu/catalog/item/default.asp?tid=10341&#...

I find it is less brutal than Bishop's book, providing insightful analogies. It won't make you an expert, but it will give you a solid starting point.

shaggyfrog · on Feb 3, 2011

If you later want to drill down more into reinforcement learning specifically, check out "Reinforcement Learning: An Introduction" by Sutton & Barto, full version available online in HTML form:

http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

tastybites · on Feb 3, 2011

I'm currenty reading 'Machine Learning, an algorithmic perspective' by Marsland. He uses Python and assumes no higher math knowledge. I would recommend it to get a basic and intuitive understanding. He explains things in English, then presents the math notation, then explains the math notation in English again. Then, he provides Python code, and explains the Python code in English. Radical, I know.

I'm also using http://www.nltk.org/ NLTK to do document classification for a client right now, using insights gleaned from his book and also metaoptimzie (http://metaoptimize.com/qa).

maeon3 · on Feb 4, 2011

Douglas Hofstadter, Fluid Concepts and Creative Analogies. This book will blow your mind.

http://en.wikipedia.org/wiki/Fluid_Concepts_and_Creative_Ana...

mahmud · on Feb 3, 2011

Machine learning toolkits are the new "web framework".

jimbokun · on Feb 3, 2011

In what sense?

mahmud · on Feb 3, 2011

their proliferation as half-assed projects that all accomplish the same thing, but differ in their all-too-similar implementation languages.

I don't mean this particular project, but generally, you can't be in the field without seeing ten frameworks a day.

tastybites · on Feb 3, 2011

With the primary difference being web frameworks are much better documented with real world examples because they are written by people in industry trying to make their real jobs easier, not university grad programs. The examples released by users never make it onto the search engines, if ever released (see below).

Most ML frameworks out there are stuck in academic-land and assume the users are experts - when the exact opposite is usually true - they actually attempt to use the most opaque language possible when describing usage.

ML is still a consulting gold mine because it's so difficult to wade past the jargon and bullshit to actually do something useful/profitable with these frameworks.