Embed, encode, attend, predict: the new deep learning formula for NLP models

YeGoblynQueenne · on Nov 11, 2016

That's a nice article in that it manages to get excitement across without forgetting to balance it somewhat. Kind of a rare thing these days.

Btw, I'm interested to hear how well training with large one-hot encoded vectors scales. A paper someone pointed me to recently on HN suggested that it doesn't scale very well:

One-shot Learning with Memory-Augmented Neural Networks [https://arxiv.org/abs/1605.06065]

syllogism · on Nov 11, 2016

Here's the implementation of the first example model, the Parikh et al. (2016) textual entailment system:

https://github.com/explosion/spaCy/tree/master/examples/kera...

This got dropped during editing...Updating the post to make this more prominent.

1024core · on Nov 11, 2016

I read the full post, thanks for writing it. It is very clear, but I do have a couple of questions:

1. In Step (2), Bidirectional RNN: what are you making the forward/backward passes over? How do the tokens get turned into a "matrix" ? What is the dimensionality of this matrix?

2. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices from?

It would be nice to bring in some concreteness: talk about sentences, documents, etc. and how they map into this scheme.

Thanks!

syllogism · on Nov 11, 2016

The implementation and papers are probably much clearer about the details. This post might also help: https://explosion.ai/blog/spacy-deep-learning-keras

I'll answer briefly about the Parikh et al model.

1) Input: (ids1, ids2). These are integer-typed arrays of length len1 and len2

2) sent1 = embed(ids1); sent2 = embed(ids2). Data is now real-value arrays of shape (len1, vector_dim) and (len2, vector_dim) respectively. 300 is a common value for vector_dim, e.g. from the GloVe common crawl model.

3) sent1 = encode(sent1); sent2 = encode(sent2). Data is now real-valued arrays of shape (len1, fwd_dim+bwd_dim), (len2, fwd_dim+bwd_dim).

4a) attention = create_attention_matrix(sent1, sent2). This is a real-valued array of shape (len1, len2)

4b) align1 = soft_align(sent1, attention); align2 = soft_align(sent2, transpose(attention)). These are a real-valued array of shape (len1, compare_dim), (len2, compare_dim)

4c) feats1 = sum(map(compare(sent1, align2))); feats2 = sum(map(compare(sent2, align1))). These are real-valued arrays of shape (predict_dim,), (predict_dim,)

5. class_id = predict(feats1, feats2)

The post describes steps 4a, 4b and 4c as a single operation that takes the two 2-dimensional sentence representations as input and outputs a single vector (obtained by concatenating the representations feats1 and feats2 in this description).

sixhobbits · on Nov 11, 2016

I really like this post. So much NLP research is 'locked away' in academic papers, and making the knowledge more accessible through posts like this is very important for large-scale adoption by non-academics.

Also, really well done on the site design. Love the graphics, font, layout and 'progress bar' animation at the top. Very nice UX overall.

mtrimpe · on Nov 11, 2016

I really loved reading this article but it's always so hard to figure out exactly how these things work out in detail.

I understand matrix multiplication but it seems that (some of) these matrix to vector calculations are actually trained by/as part of the neural net... but how exactly that works I can't figure out coming at it from articles like this.

syllogism · on Nov 11, 2016

(Author here)

Thanks! I'm planning to make two follow up posts, on each of the systems, that go through those details. I blurred them out in this post because I wanted to get across this more abstract story about the data types and transformations.

There are lots of good posts about attention mechanisms. The WildML post is good, as is Chris Olah's post. Bidirectional RNNs are a little bit less well covered, but the idea is not too difficult to understand given a single RNN (or LSTM, GRU etc).

You should also read the papers :). This is how most people who are doing ML --- including the people building practical things, not researchers --- are staying up to date. Academia is so competitive and writing is cheap relative to experimentation. The deep learning literature is really pretty easy to follow.

dharma1 · on Nov 11, 2016

Really like what you're doing with SpaCy and explosionAI, good stuff :)

What do you think about dilated convolutional encoder/decoder networks [1]? Useful for NLP beyond machine translation?

[1] https://arxiv.org/abs/1610.10099, https://github.com/paarthneekhara/byteNet-tensorflow

syllogism · on Nov 11, 2016

Thanks!

I don't understand those models very well yet. I haven't implemented one, or really sat down with the paper and really worked through it.

viksit · on Nov 11, 2016

One of the main issues with character level CNN's (irrespective of convolution type IIRC) is the inability of the model to handle unknown words, which is something that word level models do well. So if you look at applications of NLP in domains that need this to work well, you won't get much from purely char models in my experience.

thess24 · on Nov 11, 2016

Looking forward to those posts! Just want to say I am a big fan of spaCy and your site looks great.

herrkanin · on Nov 11, 2016

If you want to learn more about it, I recommend you to read neural networks and deep learning: http://neuralnetworksanddeeplearning.com

imh · on Nov 12, 2016

One other cool part of attention is that you can attend to m-dimensional parts of a n-by-m matrix just as well as a k-by-m matrix. Objects (sentences) of varying size can be treated the same in a really nicely principled way.

bertomartin · on Nov 11, 2016

Could you use such a model to do sequence labeling? For example, I have a text document such as a financial document and I want to detect where in that document it states that a "stock split" will occur, or "share repurchase" and how much. This seems like a good approach given that it learns context. I know there are NER methods, but this is slightly different. I want to train a model to recognize specific events. The best I can do right now is a regex.

syllogism · on Nov 11, 2016

If you want to tag sequential spans of text, then you've basically got the same "shape" of problem as named entity recognition, just with different labels and data. BiLSTMs work well for this.

Mathnerd314 · on Nov 11, 2016

typo: contradition

I just wish I understood the rest of the article...