Deep Learning for NLP Best Practices

peteratt · on July 26, 2017

Are there any libraries out there that implement some/most of these best practices and approaches to NLP? From what I've seen, the existing ones (Stanford NLP, OpenNLP) are getting somewhat dated. Many non-PhD people (including me) would find such a library incredibly useful.

Cybiote · on July 26, 2017

As with almost all things machine learning, python libraries are where you will find most of the action.

I can recommend https://spacy.io for a low fuss solution to get you up and running quickly.

Oversimplifying quite a bit, if spacy focuses on syntax, then gensim focuses on semantics (https://en.wikipedia.org/wiki/Distributional_semantics). Gensim has an active community and is well documented.

https://radimrehurek.com/gensim/

If you have the data, can spend a few days experimenting and if you want something that can be orders of magnitude faster than deep learning to train, there's vowpal wabbit. Prediction speed is blazing. Results can be nearly state of the art but with a cost that's a great deal less. It's C++ but with bindings for many languages. It's very poorly documented.

https://github.com/JohnLangford/vowpal_wabbit/wiki/Learning%...

I've never taken a gander at Facebook's fastText.

syllogism · on July 26, 2017

(I'm the lead author of spaCy)

spaCy 2 is reconfigured for deep learning. It's still in alpha, but there's already a lot there:

http://alpha.spacy.io/docs/

I wrote a whole neural network library to get this done, because Tensorflow and Theano are terrible for the type of models NLP needs. Joke was sort of on me, because PyTorch came out just as I was finishing :). But it's actually very good to own the dependencies of the library anyway, since it's such an important part of things. Having our own NN library has made it easy to make lots of small innovations along the way. The best one is hash-kernel powered embeddings, which have just been published as "bloom embeddings". I've been using these for the last six months, with great results.

You can read my thoughts about NLP best-practices here:

https://explosion.ai/blog/deep-learning-formula-nlp

Like Sebastian (and pretty much anyone else), I think the two improvements to emphasise in NLP are sequence models like LSTM, and transfer learning. That's what's better now with deep learning: what we used to call semi-supervised approaches and domain adaptation now work much better than they did before. Incidentally Ruder et al. (2017)'s sluice networks are an important recent paper on this: https://arxiv.org/abs/1705.08142

Going forward I think it's important that we get past just using word2vec to pre-train vectors, and start making it easier to use pre-trained LSTM and CNN models. Side-objectives in multi-task learning are also very important.

I don't think the APIs in spaCy around these things are quite right yet. There are also lots of trade-offs in sharing weights, that make things complicated for people. Sometimes weight-sharing gets in the way, because you want to just train this one part, and it's really weird that your updates are affecting other models you don't think you're touching.

For more idea of where Ines and I are going with all of this, you can read this: https://explosion.ai/blog/supervised-learning-data-collectio...

Basically I think the main problem people are having with NLP is that they don't want to commit to a problem and create training and evaluation data for it. Teams that don't bite the bullet and commit to their problem thrash around and don't get anything done. Even if you're using unsupervised techniques, you need repeatable evaluations.

We're preparing to launch an evaluation tool to address this problem. You can subscribe to our mailing list, RSS or Twitter to get the announcement when the beta is ready: https://twitter.com/explosion_ai

arrmn · on July 26, 2017

I've used spaCy for a customer project without having any NLP or deep learning experience and it was really easy to use and it's fast enough for our use case. So thanks for providing such an awesome library.

rpedela · on July 26, 2017

I have found CoreNLP to be the best library for POS tagging, NER, etc and Gensim for word vectors and summarization. Others have recommended Spacy, but I have found it to be inferior to CoreNLP.

nl · on July 26, 2017

Just noting that CoreNLP is StanfordNLP (for those who are unaware).

I think it is widely accepted that CoreNLP outperforms Spacy in terms of accuracy for POS and NER.

Spacy is very convenient and works well enough for most tasks. It can also load Gensim vectors.

nl · on July 26, 2017

OpenNLP is pretty mediocre.

StanfordNLP is pretty close to the state of the art for things like POS tagging (in English anyway. Not familiar with benchmarks in other languages)

As mentioned in the other reply, Spacy, Gensim, Fasttext and VW are great for specific things.

mtthwmtthw · on July 26, 2017

What is your criteria if you think OpenNLP is mediocre?

physicsyogi · on July 27, 2017

Stanford NLP is currently state of the art for single document coreference resolution. https://github.com/clarkkev/deep-coref

orthoganol · on July 26, 2017

This feels a little hit or miss, for example:

> One way to decrease the risk of vanishing gradients is to clip their maximum value

But probably helpful for a general picture if you're new to this stuff.

avarun · on July 26, 2017

Looks cool, but how do I know that this list of best practices is the definitive list of best practices, given how many lists claim that title?

cleansy · on July 26, 2017

"Best practices" are pretty much always according to the author. From the article:

> Disclaimer: Treating something as best practice is notoriously difficult: Best according to what? What if there are better alternatives? This post is based on my (necessarily incomplete) understanding and experience. In the following, I will only discuss practices that have been reported to be beneficial independently by at least two different groups. I will try to give at least two references for each best practice.

fageyogurtspoon · on July 26, 2017

The comprehensive bibliography of 67 papers might be a start.