Are there any libraries out there that implement some/most of these best practices and approaches to NLP? From what I've seen, the existing ones (Stanford NLP, OpenNLP) are getting somewhat dated. Many non-PhD people (including me) would find such a library incredibly useful.
If you have the data, can spend a few days experimenting and if you want something that can be orders of magnitude faster than deep learning to train, there's vowpal wabbit. Prediction speed is blazing. Results can be nearly state of the art but with a cost that's a great deal less. It's C++ but with bindings for many languages. It's very poorly documented.
I wrote a whole neural network library to get this done, because Tensorflow and Theano are terrible for the type of models NLP needs. Joke was sort of on me, because PyTorch came out just as I was finishing :). But it's actually very good to own the dependencies of the library anyway, since it's such an important part of things. Having our own NN library has made it easy to make lots of small innovations along the way. The best one is hash-kernel powered embeddings, which have just been published as "bloom embeddings". I've been using these for the last six months, with great results.
You can read my thoughts about NLP best-practices here:
Like Sebastian (and pretty much anyone else), I think the two improvements to emphasise in NLP are sequence models like LSTM, and transfer learning. That's what's better now with deep learning: what we used to call semi-supervised approaches and domain adaptation now work much better than they did before. Incidentally Ruder et al. (2017)'s sluice networks are an important recent paper on this: https://arxiv.org/abs/1705.08142
Going forward I think it's important that we get past just using word2vec to pre-train vectors, and start making it easier to use pre-trained LSTM and CNN models. Side-objectives in multi-task learning are also very important.
I don't think the APIs in spaCy around these things are quite right yet. There are also lots of trade-offs in sharing weights, that make things complicated for people. Sometimes weight-sharing gets in the way, because you want to just train this one part, and it's really weird that your updates are affecting other models you don't think you're touching.
Basically I think the main problem people are having with NLP is that they don't want to commit to a problem and create training and evaluation data for it. Teams that don't bite the bullet and commit to their problem thrash around and don't get anything done. Even if you're using unsupervised techniques, you need repeatable evaluations.
We're preparing to launch an evaluation tool to address this problem. You can subscribe to our mailing list, RSS or Twitter to get the announcement when the beta is ready: https://twitter.com/explosion_ai
I've used spaCy for a customer project without having any NLP or deep learning experience and it was really easy to use and it's fast enough for our use case. So thanks for providing such an awesome library.
I have found CoreNLP to be the best library for POS tagging, NER, etc and Gensim for word vectors and summarization. Others have recommended Spacy, but I have found it to be inferior to CoreNLP.
"Best practices" are pretty much always according to the author. From the article:
> Disclaimer: Treating something as best practice is notoriously difficult: Best according to what? What if there are better alternatives? This post is based on my (necessarily incomplete) understanding and experience. In the following, I will only discuss practices that have been reported to be beneficial independently by at least two different groups. I will try to give at least two references for each best practice.