Hacker News new | past | comments | ask | show | jobs | submit login

The truth is nltk is basically crap for real work, but there's so little NLP software that's put proper effort into documentation that nltk still gets a lot of use.

You can work your way down the vast number of nltk modules, and you'll find almost none of them are useful for real work, and those that are, ship a host of alternatives that are all much worse than the current state-of-the-art.

nltk makes most sense as a teaching tool, but even then it's mostly out of date. The chapter on "Parsing" in the nltk book doesn't even really deal with statistical parsing. The dependency parsing work referenced in this post is almost all 1-3 years old, so obviously it isn't covered either.

As an integration layer, nltk is so much more trouble than it's worth. You can use it to compute some of your scoring metrics, or read in a corpus, but...why bother?

I'm slowly putting together an alternative, where you get exactly one tokeniser, exactly one tagger, etc. All these algorithms have the same i/o, so we shouldn't ask a user to choose one. We should just provide the best one.

My previous post, on POS tagging, shows that ntlk's POS tagger is incredibly slow, and not very accurate: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-spe... . nltk scores 94% in 3m56s, my 200 line implementation scores 96.8% in 12s.

I used to use nltk for tokenisation and sentence-boundary detection, but this library seems better for that: https://code.google.com/p/splitta/




As an aside, OpenCV seems to have the same problem. Widely used and at least a std deviation away from OK.


Totally. OpenCV was my first case of encountering and having then to work out a way around bugs in something that's supposed to "just work". Not many other options, though.


The ones I know are: http://simplecv.org and http://libccv.org/post/an-elephant-in-the-room/ (introduced on HN 2 yrs ago: https://news.ycombinator.com/item?id=4180979 )

Anyone have thoughts on how they or other libs compare to openCV?


OpenCV is the OpenSSL of vision libraries.


Since NLTK does enjoy some popularity, would it make sense to try to get some of your code into the project? I'm sure a lot of people could benefit from using/learning about more current algorithms.


My code is BSD licensed, so nltk can take and adapt it in any way they choose. However, I won't do any of the work to integrate it into their class hierarchies, or to write the tests and documentation in the way that they require.

I expect a successful patch request would take a lot of work. That's not a criticism; it's just how it is, given the aims, history and status of their project.


I was thinking about working through the NLTK book once I'm finished with Bishop's Pattern Recognition, would you be able to recommend an alternative?


check out Michael Collins's NLP course: https://www.coursera.org/course/nlangp

and notes: http://www.cs.columbia.edu/~mcollins/

he talks about averaged perceptron at the end (Lectures on generalized log-linear models - GLM http://www.cs.columbia.edu/~cs4705/)

perceptron tagger code (hw4 solution) can be found here https://github.com/emmadoraruth/Perceptron_Tagger.git


I think the book "Speech and Language Processing, 2nd Edition" from Prof. Daniel Jurafsky is very good http://www.amazon.com/Speech-Language-Processing-Daniel-Jura...

You can also check out the great online NLP course taught by the author and Prof. Chris Manning from Stanford: https://www.youtube.com/watch?v=nfoudtpBV68&list=PL6397E4B26...


Dependency Parsing by Nivre et al was a good source for catching up from an NLP course to state-of-the-art http://www.amazon.com/Dependency-Synthesis-Lectures-Language...


It's still tough to recommend that, imo. If you could choose to beam it straight into your head? Yeah, go ahead. But working through a book takes a lot of time...It only gives you dependency parsing, and then you have to catch up on the last five years of dependency parsing.


There's no alternative NLP book I can really recommend, no --- sorry. It's moving too fast, the books are going out of date too quickly.

You could try Hal Daume's blog, and Bob Carpenter's blog.


Here is a review [0] by Bob Carpenter of Foundations of Statistical Natural Language Processing [1].

[0] http://www.amazon.com/review/R2FUAZHGUOERHV

[1] http://www.amazon.com/Foundations-Statistical-Natural-Langua...


I know that book well. It's too dated.


Are there any feasible production-ready NLP solutions? My default approach has been to browse the codebases of different computational linguistics labs, but something more central would be very handy. As it stands I just say something to the tune, "please don't use nltk, use opennlp, cleannlp, redshift, etc".


If you prefer C, you can try SENNA: http://ml.nec-labs.com/senna/. It includes POS tagging, chunking, NER, constituency parsing and semantic role labeling. But dependency parsing is not there yet.

It's super fast (thanks to C) and very accurate (thanks to Deep Learning approach). The license is not for commercial usage though.

SENNA can be used with NLTK: http://pydoc.net/Python/nltk/2.0.2/nltk.tag.senna/


ClearNLP looks the best imo, especially you're already using Java. If you're using Python...well, Redshift isn't production-ready, but if you needed to, it'd be the best thing to base your work off.

ClearNLP has a lot of nifty bells and whistles that would make a big difference. In particular, it selects the model for you, based on your text, and its similarity to various subsets of the training data. So, you get a model more matched to your data, which will improve real-world accuracies a lot.


If that overhead comes from Python then have you tried running either of the implementations on PyPy?


Thanks that was enlightening.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: