The article seems a little rough (draft?), but the money shot is the algo in Section 5. Looks pretty neat and straightforward!
TL;DR:
1. Build a special distance(word1, word2) metric using a set of word vectors trained elsewhere (such as GloVe). This distance works better than cosine distance.
2. Given a document (a sentence, a paragraph… basically, a sequence of words), calculate the "importance" of each word as a sigmoid over distances(word, avg(all_words)).
3. To embed a document, simply do a weighted average of its word vectors, where the weight of each word equals the importance above.
A small suggestion: I think the way you're describing this as "context" might be misleading. You'll probably want to try the same thing on the output of a BiLSTM or CNN model instead of just the word vectors. I think this is more like an attention mechanism --- like tf-idf, you're learning to downweight words which are less informative. Your method is also very similar in effect to the LexRank line of work, which is more computationally expensive and complicated.
‘Attention is all you need’, was the first thought i had when i read the title of your paper. It might be worthwhile to have a small bit explaining the distinction between attention and context as you view them. If i understand correctly you are describing how on to build a highly useful context representation. The application of that context in a neural net would be to focus the attention of the neural network on the parts of the input that are more important. From my point of view your paper here helps answer the question: what directs the attention?
Apologies if all of this is blindingly obvious. I’m a fascinated amateur.
Yep, exactly. The current paper presents the contextual model and how to interact with it, and then shows that even simple models (such as the weighted bag of words in the first parent comment) based on it do quite well. I really appreciate the interest and please let me know if you have any questions! :)
I just think that titling a paper about a weighted bag-of-words approach "Context is everything" really starts the reader off on the wrong foot. In fact there's almost no context in your model, in contrast to most other sentence learning methods!
Was this a 224n project? I'm hoping to try doing the class online on my own. Do you think the 2017 videos are still pretty in-line with the topics on the current website?
Hi! This started off as a 224n project, but it's gone through quite a bit of work since then. The class slides are from the last class are pretty complete. I think they put a lot of time in to keep it updated, especially towards the end
Very cool. Any interesting uses for this? I want to develop a 'cognitive debiasing' ML system. It will take real time spoken languages and parse it syntactically to output a dibiased version of the input. Let me know if it have any resources or insight on this. Thanks!
It sounds like you want to be able to generate language with the same basic meaning as the original but without whatever bias was there.
I'd recommend starting by trying to detect the presence of the bias first. Language generation is a very difficult problem in general and this is an especially difficult instance of it. Even detecting bias will be challenging because of how subtle a phenomenon it is, but it's much more tractable than rewriting what was said to not include the detected bias.
Not sure what the technical term is but by debias I meant the ability to detect cognitive biases [1] from real time spoken input and remove them from the output. The goal would be to help humans understand their biased thinking and over time make us less biased in our decision making process.
TL;DR:
1. Build a special distance(word1, word2) metric using a set of word vectors trained elsewhere (such as GloVe). This distance works better than cosine distance.
2. Given a document (a sentence, a paragraph… basically, a sequence of words), calculate the "importance" of each word as a sigmoid over distances(word, avg(all_words)).
3. To embed a document, simply do a weighted average of its word vectors, where the weight of each word equals the importance above.