That's a nice article in that it manages to get excitement across without forgetting to balance it somewhat. Kind of a rare thing these days.
Btw, I'm interested to hear how well training with large one-hot encoded vectors scales. A paper someone pointed me to recently on HN suggested that it doesn't scale very well:
I read the full post, thanks for writing it. It is very clear, but I do have a couple of questions:
1. In Step (2), Bidirectional RNN: what are you making the forward/backward passes over? How do the tokens get turned into a "matrix" ? What is the dimensionality of this matrix?
2. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices from?
It would be nice to bring in some concreteness: talk about sentences, documents, etc. and how they map into this scheme.
1) Input: (ids1, ids2). These are integer-typed arrays of length len1 and len2
2) sent1 = embed(ids1); sent2 = embed(ids2). Data is now real-value arrays of shape (len1, vector_dim) and (len2, vector_dim) respectively. 300 is a common value for vector_dim, e.g. from the GloVe common crawl model.
3) sent1 = encode(sent1); sent2 = encode(sent2). Data is now real-valued arrays of shape (len1, fwd_dim+bwd_dim), (len2, fwd_dim+bwd_dim).
4a) attention = create_attention_matrix(sent1, sent2). This is a real-valued array of shape (len1, len2)
4b) align1 = soft_align(sent1, attention); align2 = soft_align(sent2, transpose(attention)). These are a real-valued array of shape (len1, compare_dim), (len2, compare_dim)
4c) feats1 = sum(map(compare(sent1, align2))); feats2 = sum(map(compare(sent2, align1))). These are real-valued arrays of shape (predict_dim,), (predict_dim,)
5. class_id = predict(feats1, feats2)
The post describes steps 4a, 4b and 4c as a single operation that takes the two 2-dimensional sentence representations as input and outputs a single vector (obtained by concatenating the representations feats1 and feats2 in this description).
I really like this post. So much NLP research is 'locked away' in academic papers, and making the knowledge more accessible through posts like this is very important for large-scale adoption by non-academics.
Also, really well done on the site design. Love the graphics, font, layout and 'progress bar' animation at the top. Very nice UX overall.
I really loved reading this article but it's always so hard to figure out exactly how these things work out in detail.
I understand matrix multiplication but it seems that (some of) these matrix to vector calculations are actually trained by/as part of the neural net... but how exactly that works I can't figure out coming at it from articles like this.
Thanks! I'm planning to make two follow up posts, on each of the systems, that go through those details. I blurred them out in this post because I wanted to get across this more abstract story about the data types and transformations.
There are lots of good posts about attention mechanisms. The WildML post is good, as is Chris Olah's post. Bidirectional RNNs are a little bit less well covered, but the idea is not too difficult to understand given a single RNN (or LSTM, GRU etc).
You should also read the papers :). This is how most people who are doing ML --- including the people building practical things, not researchers --- are staying up to date. Academia is so competitive and writing is cheap relative to experimentation. The deep learning literature is really pretty easy to follow.
One of the main issues with character level CNN's (irrespective of convolution type IIRC) is the inability of the model to handle unknown words, which is something that word level models do well. So if you look at applications of NLP in domains that need this to work well, you won't get much from purely char models in my experience.
One other cool part of attention is that you can attend to m-dimensional parts of a n-by-m matrix just as well as a k-by-m matrix. Objects (sentences) of varying size can be treated the same in a really nicely principled way.
Could you use such a model to do sequence labeling? For example, I have a text document such as a financial document and I want to detect where in that document it states that a "stock split" will occur, or "share repurchase" and how much. This seems like a good approach given that it learns context. I know there are NER methods, but this is slightly different. I want to train a model to recognize specific events. The best I can do right now is a regex.
If you want to tag sequential spans of text, then you've basically got the same "shape" of problem as named entity recognition, just with different labels and data. BiLSTMs work well for this.
Btw, I'm interested to hear how well training with large one-hot encoded vectors scales. A paper someone pointed me to recently on HN suggested that it doesn't scale very well:
One-shot Learning with Memory-Augmented Neural Networks [https://arxiv.org/abs/1605.06065]