Hacker News new | past | comments | ask | show | jobs | submit login

How did you select words to compare? Did you have to try many poor combinations before selecting a "good" set?



I've played with word2vec, and it took nothing to get really interesting combinations.

The first thing I tried was computer : server :: phone : ? I didn't really have a great answer for that in my head before I ran it. Word2vec decided the closest match was "voicemail". It breaks down when you feed it total nonsense, but what would you expect it to do? :P

I'm constantly super impressed by properties of the vectors.


Mikolov, et al. 2013 [1] do a proper evaluation of this. E.g. they found that the skip-ngram model has a 50.0% accuracy for semantic analogy queries and 55.9% accuracy for syntactic queries.

word2vec comes with a data set that you can use to evaluate language models.

[1] http://arxiv.org/pdf/1301.3781.pdf


I would insist on a better dataset before really calling these "semantic analogies" (and don't just take my word for it: Chris Manning complained about exactly this in his recent NAACL talk).

The only semantics that it tests are "can you flip a gendered word to the other gender", which is so embedded in language that it's nearly syntax; and "can you remember factoids from Wikipedia infoboxes", a problem that you could solve exactly using DBPedia. Every single semantic analogy in the dataset is one of those two types.

The syntactic analogies are quite solid, though.


and "can you remember factoids from Wikipedia infoboxes",

That's a simplification. E.g. I have trained vectors on Wikipedia dumps without infoboxes, and I queries such as Berlin - Deutschland + Frankreich work fine.

Of course, even the remainder of Wikipedia is nice text in that it will contain sentences such as 'Berlin is the capital of Germany'. So, indeed, it makes doing typical factoid analogies easier.

That said -- I am more interested in the syntactic properties :).


I didn't mean that you have to learn the data from Wikipedia infoboxes, just that that's a prominent place to find factoids.

It's a data source that you could consult to pass 99% of the "semantic analogy" evaluation with no machine learning at all, which is an indication that a stronger evaluation is needed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: