The truth is nltk is basically crap for real work, but there's so little NLP sof...

sitkack · on April 28, 2014

As an aside, OpenCV seems to have the same problem. Widely used and at least a std deviation away from OK.

theon144 · on April 28, 2014

Totally. OpenCV was my first case of encountering and having then to work out a way around bugs in something that's supposed to "just work". Not many other options, though.

fuzzythinker · on April 29, 2014

The ones I know are: http://simplecv.org and http://libccv.org/post/an-elephant-in-the-room/ (introduced on HN 2 yrs ago: https://news.ycombinator.com/item?id=4180979 )

Anyone have thoughts on how they or other libs compare to openCV?

sitkack · on April 28, 2014

OpenCV is the OpenSSL of vision libraries.

jliechti1 · on April 28, 2014

Since NLTK does enjoy some popularity, would it make sense to try to get some of your code into the project? I'm sure a lot of people could benefit from using/learning about more current algorithms.

syllogism · on April 28, 2014

My code is BSD licensed, so nltk can take and adapt it in any way they choose. However, I won't do any of the work to integrate it into their class hierarchies, or to write the tests and documentation in the way that they require.

I expect a successful patch request would take a lot of work. That's not a criticism; it's just how it is, given the aims, history and status of their project.

shabadoop · on April 28, 2014

I was thinking about working through the NLTK book once I'm finished with Bishop's Pattern Recognition, would you be able to recommend an alternative?

bra-ket · on April 28, 2014

check out Michael Collins's NLP course: https://www.coursera.org/course/nlangp

and notes: http://www.cs.columbia.edu/~mcollins/

he talks about averaged perceptron at the end (Lectures on generalized log-linear models - GLM http://www.cs.columbia.edu/~cs4705/)

perceptron tagger code (hw4 solution) can be found here https://github.com/emmadoraruth/Perceptron_Tagger.git

vkhuc · on April 28, 2014

I think the book "Speech and Language Processing, 2nd Edition" from Prof. Daniel Jurafsky is very good http://www.amazon.com/Speech-Language-Processing-Daniel-Jura...

You can also check out the great online NLP course taught by the author and Prof. Chris Manning from Stanford: https://www.youtube.com/watch?v=nfoudtpBV68&list=PL6397E4B26...

habeanf · on April 28, 2014

Dependency Parsing by Nivre et al was a good source for catching up from an NLP course to state-of-the-art http://www.amazon.com/Dependency-Synthesis-Lectures-Language...

syllogism · on April 28, 2014

It's still tough to recommend that, imo. If you could choose to beam it straight into your head? Yeah, go ahead. But working through a book takes a lot of time...It only gives you dependency parsing, and then you have to catch up on the last five years of dependency parsing.

syllogism · on April 28, 2014

There's no alternative NLP book I can really recommend, no --- sorry. It's moving too fast, the books are going out of date too quickly.

You could try Hal Daume's blog, and Bob Carpenter's blog.

sitkack · on April 28, 2014

Here is a review [0] by Bob Carpenter of Foundations of Statistical Natural Language Processing [1].

[0] http://www.amazon.com/review/R2FUAZHGUOERHV

[1] http://www.amazon.com/Foundations-Statistical-Natural-Langua...

syllogism · on April 28, 2014

I know that book well. It's too dated.

cf · on April 28, 2014

Are there any feasible production-ready NLP solutions? My default approach has been to browse the codebases of different computational linguistics labs, but something more central would be very handy. As it stands I just say something to the tune, "please don't use nltk, use opennlp, cleannlp, redshift, etc".

vkhuc · on April 30, 2014

If you prefer C, you can try SENNA: http://ml.nec-labs.com/senna/. It includes POS tagging, chunking, NER, constituency parsing and semantic role labeling. But dependency parsing is not there yet.

It's super fast (thanks to C) and very accurate (thanks to Deep Learning approach). The license is not for commercial usage though.

SENNA can be used with NLTK: http://pydoc.net/Python/nltk/2.0.2/nltk.tag.senna/

syllogism · on April 29, 2014

ClearNLP looks the best imo, especially you're already using Java. If you're using Python...well, Redshift isn't production-ready, but if you needed to, it'd be the best thing to base your work off.

ClearNLP has a lot of nifty bells and whistles that would make a big difference. In particular, it selects the model for you, based on your text, and its similarity to various subsets of the training data. So, you get a model more matched to your data, which will improve real-world accuracies a lot.

orf · on April 28, 2014

If that overhead comes from Python then have you tried running either of the implementations on PyPy?

ma2rten · on April 28, 2014

Thanks that was enlightening.