How did you select words to compare? Did you have to try many poor combinations ...

MrLeap · on June 14, 2015

I've played with word2vec, and it took nothing to get really interesting combinations.

The first thing I tried was computer : server :: phone : ? I didn't really have a great answer for that in my head before I ran it. Word2vec decided the closest match was "voicemail". It breaks down when you feed it total nonsense, but what would you expect it to do? :P

I'm constantly super impressed by properties of the vectors.

danieldk · on June 14, 2015

Mikolov, et al. 2013 [1] do a proper evaluation of this. E.g. they found that the skip-ngram model has a 50.0% accuracy for semantic analogy queries and 55.9% accuracy for syntactic queries.

word2vec comes with a data set that you can use to evaluate language models.

[1] http://arxiv.org/pdf/1301.3781.pdf

rspeer · on June 15, 2015

I would insist on a better dataset before really calling these "semantic analogies" (and don't just take my word for it: Chris Manning complained about exactly this in his recent NAACL talk).

The only semantics that it tests are "can you flip a gendered word to the other gender", which is so embedded in language that it's nearly syntax; and "can you remember factoids from Wikipedia infoboxes", a problem that you could solve exactly using DBPedia. Every single semantic analogy in the dataset is one of those two types.

The syntactic analogies are quite solid, though.

danieldk · on June 16, 2015

and "can you remember factoids from Wikipedia infoboxes",

That's a simplification. E.g. I have trained vectors on Wikipedia dumps without infoboxes, and I queries such as Berlin - Deutschland + Frankreich work fine.

Of course, even the remainder of Wikipedia is nice text in that it will contain sentences such as 'Berlin is the capital of Germany'. So, indeed, it makes doing typical factoid analogies easier.

That said -- I am more interested in the syntactic properties :).

rspeer · on June 16, 2015

I didn't mean that you have to learn the data from Wikipedia infoboxes, just that that's a prominent place to find factoids.

It's a data source that you could consult to pass 99% of the "semantic analogy" evaluation with no machine learning at all, which is an indication that a stronger evaluation is needed.