The truth is nltk is basically crap for real work, but there's so little NLP software that's put proper effort into documentation that nltk still gets a lot of use.
You can work your way down the vast number of nltk modules, and you'll find almost none of them are useful for real work, and those that are, ship a host of alternatives that are all much worse than the current state-of-the-art.
nltk makes most sense as a teaching tool, but even then it's mostly out of date. The chapter on "Parsing" in the nltk book doesn't even really deal with statistical parsing. The dependency parsing work referenced in this post is almost all 1-3 years old, so obviously it isn't covered either.
As an integration layer, nltk is so much more trouble than it's worth. You can use it to compute some of your scoring metrics, or read in a corpus, but...why bother?
I'm slowly putting together an alternative, where you get exactly one tokeniser, exactly one tagger, etc. All these algorithms have the same i/o, so we shouldn't ask a user to choose one. We should just provide the best one.
Totally. OpenCV was my first case of encountering and having then to work out a way around bugs in something that's supposed to "just work". Not many other options, though.
Since NLTK does enjoy some popularity, would it make sense to try to get some of your code into the project? I'm sure a lot of people could benefit from using/learning about more current algorithms.
My code is BSD licensed, so nltk can take and adapt it in any way they choose. However, I won't do any of the work to integrate it into their class hierarchies, or to write the tests and documentation in the way that they require.
I expect a successful patch request would take a lot of work. That's not a criticism; it's just how it is, given the aims, history and status of their project.
It's still tough to recommend that, imo. If you could choose to beam it straight into your head? Yeah, go ahead. But working through a book takes a lot of time...It only gives you dependency parsing, and then you have to catch up on the last five years of dependency parsing.
Are there any feasible production-ready NLP solutions? My default approach has been to browse the codebases of different computational linguistics labs, but something more central would be very handy. As it stands I just say something to the tune, "please don't use nltk, use opennlp, cleannlp, redshift, etc".
If you prefer C, you can try SENNA: http://ml.nec-labs.com/senna/. It includes POS tagging, chunking, NER, constituency parsing and semantic role labeling. But dependency parsing is not there yet.
It's super fast (thanks to C) and very accurate (thanks to Deep Learning approach). The license is not for commercial usage though.
ClearNLP looks the best imo, especially you're already using Java. If you're using Python...well, Redshift isn't production-ready, but if you needed to, it'd be the best thing to base your work off.
ClearNLP has a lot of nifty bells and whistles that would make a big difference. In particular, it selects the model for you, based on your text, and its similarity to various subsets of the training data. So, you get a model more matched to your data, which will improve real-world accuracies a lot.
You can work your way down the vast number of nltk modules, and you'll find almost none of them are useful for real work, and those that are, ship a host of alternatives that are all much worse than the current state-of-the-art.
nltk makes most sense as a teaching tool, but even then it's mostly out of date. The chapter on "Parsing" in the nltk book doesn't even really deal with statistical parsing. The dependency parsing work referenced in this post is almost all 1-3 years old, so obviously it isn't covered either.
As an integration layer, nltk is so much more trouble than it's worth. You can use it to compute some of your scoring metrics, or read in a corpus, but...why bother?
I'm slowly putting together an alternative, where you get exactly one tokeniser, exactly one tagger, etc. All these algorithms have the same i/o, so we shouldn't ask a user to choose one. We should just provide the best one.
My previous post, on POS tagging, shows that ntlk's POS tagger is incredibly slow, and not very accurate: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-spe... . nltk scores 94% in 3m56s, my 200 line implementation scores 96.8% in 12s.
I used to use nltk for tokenisation and sentence-boundary detection, but this library seems better for that: https://code.google.com/p/splitta/