LexVec, a word embedding model written in Go that outperforms word2vec

rspeer · on July 27, 2016

As pre-built word vectors go, Conceptnet Numberbatch [1], introduced less flippantly as the ConceptNet Vector Ensemble [2], already outperforms this on all the measures evaluated in its paper: Rare Words, MEN-3000, and WordSim-353.

This fact is hard to publicize because somehow the luminaries of the field decided that they didn't care about these evaluations anymore, back when RW performance was around 0.4. I have had reviewers dismiss it as "incremental improvements" to improve Rare Words from 0.4 to 0.6 and to improve MEN-3000 to be as good as a high estimate of inter-annotator agreement.

It is possible to do much, much better than Google News skip-grams ("word2vec"), and one thing that helps get there is lexical knowledge of the kind that's in ConceptNet.

[1] https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch...

[2] https://blog.luminoso.com/2016/04/06/an-introduction-to-the-...

rspeer · on July 27, 2016

That said: LexVec gives quite good results on word-relatedness for using only distributional knowledge, and only from Wikipedia at that. Adding ConceptNet might give something that is more likely to be state-of-the-art.

glup · on July 27, 2016

...And just distributional knowledge makes it easy to train new models on domain-specific corpora, or new languages. Is it possible to do the same with ConceptNet?

I generally find that expert-derived ontologies suffer from bad coverage of low frequency items, rigidly discrete relationships, and are usually limited to a single language. That said, they're vastly better than nothing for a lot of tasks (same goes for WordNet).

rspeer · on July 27, 2016

You can retrain your distributional knowledge and keep your lexical knowledge. Moving to a new domain shouldn't mean you have to forget everything about what words mean and hope you manage to learn it again.

The whole idea of Numberbatch is that a combination of distributional and lexical knowledge is much better than either one alone.

BTW, ConceptNet is only partially expert-derived (much of it is crowd-sourced), aims not to be rigid like WordNet is, and is in a whole lot of languages.

"Retraining" ConceptNet itself is a bit of a chore, but you can do it. That is, you can get the source [1], add or remove sources of data, and rebuild it. Meanwhile, if you wanted to retrain word2vec's Google News skip-gram vectors, you would have to get a machine learning job at Google.

[1] https://github.com/commonsense/conceptnet5

plusepsilon · on July 27, 2016

Cool stuff! It wasn't described in the paper (afaict), but did you have ConceptNet embeddings as well? I can certainly see a way of creating embeddings by using all the links as context. (e.g. http://conceptnet5.media.mit.edu/web/c/en/knowledge)

edit: actually looks like you "retrofit" existing word embeddings by re-weighing based on strengths of the links.

rspeer · on July 28, 2016

There kind of are ConceptNet embeddings. They weren't called "embeddings" when they were released, because it was 2008 [1]. Here's an example: http://conceptnet5.media.mit.edu/data/5.4/assoc/c/en/knowled...

We dropped working on those embeddings for a while because retrofitting was more effective. But we may have found a way to use the ConceptNet embeddings again, to improve languages besides English beyond what retrofitting is doing.

[1] http://dspace.mit.edu/handle/1721.1/51870

Karrot_Kream · on July 27, 2016

Thanks for bringing these tools to my attention! Awesome stuff!

argonaut · on July 27, 2016

Not to be too harsh, but if all you did was ensemble several different embeddings using methods that were mostly from other papers, it would be pretty obvious you'd get state of the art performance. But that is not very interesting from a novel theoretical, scientific, or even engineering perspective. It is certainly useful to the community, just not academic. Most people would consider it obvious. I could achieve state of the art performance on ImageNet if I created a larger ensemble of Inception v3 networks than the ensemble the paper used and fiddled around with different data augmentation tricks they were too lazy to use, but that's not very interesting.

rspeer · on July 28, 2016

Your claim of "I could do that" rings false.

You didn't do it. You think that you could have done it because I wrote a clear enough paper telling you how. And even if you were curious and determined enough to do it, the thing that would make it possible for you to do it are the years that I have spent making ConceptNet.

And when the "academic" things consistently do 10% or more worse than the "obvious" thing, maybe it's not that obvious.

argonaut · on July 28, 2016

I was referring to ImageNet as to what I could do (because my experience is in computer vision). But more broadly, ensembling in general is really really obvious. I mean, for heavens sake, it's the trick everyone uses on Kaggle.

You will find that is a very common reason why academics dismiss work for being too incremental - it is too obvious. It doesn't matter how many years you spent working on it.

Whether or not your work is novel depends on what you did, not strictly on whether you managed to squeeze out more performance on a benchmark.

rspeer · on July 28, 2016

Ensemble methods have been around for decades because they are a good idea. Watson -- the Jeopardy bot, not the marketing brand -- was an ensemble method.

My impression is that people avoid ensembles because they are fighting to give the impression that their work is the only work that matters. It's "not invented here" syndrome. I'm glad I'm not trying to get tenure.

argonaut · on July 28, 2016

People avoid ensembles, unless they're trying to squeeze out higher performance on a benchmark (or they're doing research on the technique of ensembling in of itself). I would have thought this would be obvious to you, since you're a researcher.

rspeer · on July 28, 2016

My company, Luminoso, uses Conceptnet Numberbatch as one component for interpreting the meaning of text. When the benchmarks went up, the understandability of its results did too. You get better search results, better topics, clearer visualizations.

I'm not just trying to squeeze out extra performance, I'm trying to make computers understand text better. The benchmarks are the evidence that it's better. I do consider myself a researcher despite leaving academia, and having some respect for evidence is part of that identity.

When academia decides to disregard evidence because evidence is for stupid Kagglers (I don't use Kaggle but I respect a good result when I see one), that's how they end up lagging behind open source software.

I understand that it's not worthwhile to chase every evaluation. For example, some evaluations are unrepresentative. Some evaluations, like parsing according to the Penn Treebank, were once useful but have been squeezed dry in a way that doesn't reflect real-world performance. And some tasks chase these unhelpful evaluations.

But I would credit Kaggle with making academics realize, slowly, that they should use random forests as a baseline when evaluating a classification method. People were content to publish classifiers that were worse than random forests until Kaggle presented overwhelming evidence that random forests worked better than many techniques.

In text understanding, the fact that seems not to have taken hold -- one that I think should be obvious, even -- is "computers understand text better when they know more facts about words". This is what ConceptNet (not the whole ensemble, but ConceptNet itself) is about.

argonaut · on July 28, 2016

This is all good. It justifies a technical report, a blog post, a workshop paper, or publication in venues looking for this kind of owrk.

It doesn't necessarily justify publication in a venue looking for totally new ideas.

herrkanin · on July 27, 2016

It feels weird how word embedding models have come to refer to both the underlying model, as well as the implementation. word2vec is the implementation of two models: the continuous bag-of-word and the skipgram models by Mikolov, while LexVec implements a version of the PPMI weighted count matrix as referenced in the README file. But the papers also discuss implementation details of LexVec that has no bearing on the final accuracy. I feel like we should make more effort to keep the models and reference implementations separate.

tdj · on July 27, 2016

Aren't skip-grams equivalent to NMF of the PPMI matrix?

https://papers.nips.cc/paper/5477-neural-word-embedding-as-i....

loudmax · on July 27, 2016

If anyone else is wondering what the heck "word embedding" means, it's a natural language processing technique.

Here's a nice blog post about it: http://sebastianruder.com/word-embeddings-1/

It can process something like this: king - man + woman = queen

Neat-o.

lukasb · on July 27, 2016

Post starts off kind of dense. Finally we get to the section "Word embedding models" and I say "ah ha! here we'll get a concise definition." Cut to ...

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

Naturally.

(Thanks, I believe that it is a great blog post, but I might look elsewhere for an intro ... :)

_pctq · on July 27, 2016

Would love such kind of intro too, especially since I'm working on a product that could greatly benefit from NPL and neural networks.

Could a kind person provide a good reference for something we could learn from?

Or (as I fear), are we past this short time in the beginning of a technic/science (I think about computing, here) when you can learn without going through academic studies?

aweinstock · on July 27, 2016

This is fairly readable (high-level) post on word embeddings: http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

iraphael · on July 28, 2016

And this is one that explains everything OP needs to understand the confusing sentence from their intro:

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip...

_pctq · on July 28, 2016

Thanks :) I feel like some key concepts are a bit beyond my reach, but that they are such a small quantity that it can be a good starting point for googling around. Thank for your help!

_pctq · on July 28, 2016

Great read, thanks!

I know understand what word vectors are and how they are useful (which, btw, allows me to make way more sense of the output of the program in this HN post :D ), plus more generalities about the field.

Anyone in the same position than me (wanting to learn the basics) should read this.

knicholes · on July 27, 2016

While dense and not necessarily for the mathematically faint of heart, I've learned a ton about NLP and ANNs through http://cs224d.stanford.edu/.

_pctq · on July 28, 2016

It's quite an investment of time, but this is a good answer to "do we have to do academic studies to learn this stuff?"

=> Maybe, but nowadays you don't have to quit your job and move close to an university to learn, you can just have online courses.

Thanks :)

dmix · on July 27, 2016

Excellent blog post. Thanks for the link, I was going to ask about it.

mooneater · on July 27, 2016

Are there IP considerations? Word2vec is patented.

meeper16 · on July 27, 2016

System and method for generating a relationship network - K Franks, CA Myers, RM Podowski - US Patent 7,987,191, 2011 - http://www.google.com/patents/US7987191

glup · on July 27, 2016

Would this really be usable in court? It seems super general to me, using a lot of common techniques. Silly question, is it infringement to use any part of the patent?

cschmidt · on July 27, 2016

It is only infringement if you do something matching every part of some claim. There may be lots of stuff in the description, and that doesn't matter. That is, if a claim is a system comprising A, B, C, and D, and you do just A, B, and C, then you're fine.

rpedela · on July 27, 2016

Slightly off-topic, but I thought this would be a good place to ask.

Are there any word embedding tools which take a Lucene/Solr/ES index as input and output a synonyms file which can be used to improve search recall?

gtani · on July 28, 2016

There's a few projects that use ES/lucene as a backend/datastore once the feature engineering is done, but I don't see models operating on the native indexes directly, maybe the format is too different from one-hot (after turning off stemming/stopwords and other info-losing steps)

http://lucene.472066.n3.nabble.com/Where-Search-Meets-Machin...

https://news.ycombinator.com/item?id=11876542

SergeyHack · on July 28, 2016

Not quite about creating synonyms, but in the same area there is Semantic Vectors https://github.com/semanticvectors/semanticvectors.

They process Lucene index and create embedded representation of it. Then you can search over that representation for "semantic" matches.

Last time I checked it about a year ago the embedded collection of documents was kept in the memory and the search was implemented by a linear scan. So I suspect it can be slow on very large collection of documents.

IshKebab · on July 27, 2016

Has anyone done any work on handing words that have overloading meanings? Something like 'lead' has two really distinct uses. It's really multiple words that happened to be spelt the same.

glup · on July 27, 2016

Google "word sense induction" or "word sense disambiguation". Intuitively, distributional information of the same sort that is used to derive representations for different word types in W2V or LexVec is useful for distinguishing word senses. Two (noun) senses of lead, two senses of bat, etc. are pretty easy to distinguish on the basis of a bag of words (or syntactic features) around them. Other words are polysemous: they have multiple related senses (across the language, names for materials can be used as containers; animal name for the corresponding food--but with exceptions). For some high frequency words it's a crazy gradient combination of polysemy and homonymy: 'home' for can refer to 1) a place someone lives 2) the corresponding physical structure 3) where something resides (a more 'metaphorical' sense), among other things. Obviously an individual use of a word has a gradient relationship to these senses, and speakers differ regarding what they think the substructure is (polysemous or homonymous, hierarchical or not, etc.). I've been working in my PhD on a technique to figure this out, but people clearly use a lot of information that isn't available in language corpora alone (e.g. intuitive physics).

rspeer · on July 27, 2016

It's tricky because we don't have good ground truth on what different word senses there are. (WordNet is not the final answer, especially as it separates every metaphorical use of the same word into its own sense.)

My experience is that you can distinguish word senses, but it seems the data isn't good enough to improve anything but a task that specifically evaluates that same vocabulary of word senses.

I see a sibling comment with link to spaCy's sense2vec, which uses the coarsest possible senses -- one sense for nouns, one sense for verbs, one sense for proper nouns, etc. It's a start.

lmmlzxx · on July 28, 2016

Actually "lead" has many, many more than two distinct senses.

https://en.wiktionary.org/wiki/lead

riyadparvez · on July 27, 2016

Well, there is Sense2Vec: https://github.com/spacy-io/sense2vec

eva1984 · on July 27, 2016

Sense2Vec can solve this one, but what if both meaning of the world are of the same pos tag?

ianbertolacci · on July 27, 2016

Reminds me of Chord[1], word2vec written in Chapel

[1] https://github.com/briangu/chord

ris · on July 27, 2016

Well done, that's probably the least relevant use of "written in go" in a HN headline I've seen. And there's some stiff competition for that title.

PaulHoule · on July 27, 2016

From the viewpoint of commercial applications I find this profoundly depressing.

When the state of the art for accuracy is 0.6 on some task, you are going to always be a bridesmaid and never a bride, but hey, you can get bragging rights cause you did well on Kaggle.

computerex · on July 27, 2016

That depends, to be honest. 60% accuracy, depending on the task, is far better than guessing at random. Secondly, depending on the task, human performance may not be that great either. Combined with controls, heuristics and validation, these "weak" models can still be of great use in commercial settings.