Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A search engine that lets you 'add' words as vectors (insightdatascience.com)
148 points by jakek on Nov 12, 2013 | hide | past | favorite | 64 comments



Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):

daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }

(Hopeless; should have had "son" in there)

pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }

(Hopeless; should have had "Mac" in there)

violin - string + woodwind -> { clarinet, oboe, flute }

(OK)

mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }

(Poor; should have had Jagger or Richards or something of the kind in the top few results)

sofa - large + small -> { relaxing, asleep, cupboard }

(Poor; I'd have hoped for "armchair" or something of the kind)


Some of this is that underlying model is insufficiently trained, but some of this is disambiguation. Disambiguation in text is a very, very, hard problem.

So some of your examples, when clarified, are a bit clearer: Paul McCartney - Beatles + Rolling Stones = Mick Jagger is in the 3rd spot. (http://www.thisplusthat.me/search/Paul%20McCartney%20-%20Bea...) Change stone -> Rolling Stones.

Thank you for the comment!


I agree disambiguation is a tough problem, but I'm curious to see whether the vector representation learned by word2vec can help in that regard. Have you considered using clustering to attach a few possible cluster memberships to each word (which would hopefully each capture information about a different usage context), then selecting the meanings for each word that give the maximum overlap?

The idea being that the words used to search on your website are likely to be conceptually similar (one doesn't add apples to oranges), so you can assume that "queen (sovereign) + man (noun) - woman" makes more sense that "queen (band) + man (unix command) - woman" because "man (noun)" and "woman" are likely to belong to similar clusters and not "man (unix command)".


Japan - Tokyo + History = Kyoto was what I was expecting but I guess the corpus isn't quite there yet. This was the answer .. http://en.wikipedia.org/wiki/Atpara_Upazila At least it's geographic!


Hitler -German +Italian = Sofiene Zaaboub apparently. Was hoping for Mussolini.


maybe just let the user choose the meaning from a drop down list? You can (probably) integrate freebase' suggest easily and use the FB api to get to the WP id.


Lisp + JVM = Clojure

I'm sold. This is really cool! (Though it's worth noting that a Google search with the same terms returns the exact same result...)


I tried Lisp + Java and got Objective C instead.


I was really excited by this writeup, so I tried it. Four test queries returned nothing that seemed useful or even relevant:

fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got Maxwell’s equations and classical mechanics (not useful);

verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got nonsense;

writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;

plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.

EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.


It's not perfect; the real limiting factor is the volume of text. For the research paper behind word2vec, to get accurate associations for common words (King, man, woman, etc.) required a news sources training text of order a billion words. This is roughly the size of all of Wikipedia, but the text in Wikipedia has many more rare words, which spreads the number of training examples per word rather thin. To address that, I'm thinking of expanding the project to use Common Crawl data (http://commoncrawl.org/) to dramatically expand the available supply of data.


I had the same experience. I haven't really gotten any convincing results except for the example queries verbatim.

Some queries I tried:

medicine - science + superstition

aquarium - wet + dry

mario - nintendo + sega

luigi - green + red

pocahontas - native americans + aliens


Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.

What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?

I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.


I'm not associated with the OP, but I have played with word2vec.

Word2vec can be viewed as dimensionality reduction. In some sense, the dimensionality of words is the same as the vocabulary size, if a one-hot encoding is used (which is most common). For word2vec specifically, the output vector dimension is a parameter specified during training, so you can choose it however you want. I suspect the OP chose something fairly large (>256) because of the performance boost.

Nearest neighbor methods work well with word2vec words: https://code.google.com/p/word2vec/#Word_clustering

The paper associated with word2vec is fairly easy to read and understand, even if you don't have a background in NLP or neural networks: http://arxiv.org/pdf/1301.3781.pdf


Hey doctoboggan, awesome questions!

>> What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by?

Each dimension roughly is a new way that words can be similar or dissimilar. So I've got 1000-dimensional vectors, so words can be similar or dissimilar in only one thousand 'ways'. So associations like 'luxury', 'thoughtful', 'person', 'place' or 'object' are learned (roughly speaking). Of course, real words are far more diverse, so this is an approximation. The 1000 dimensions is configurable, and in theory more dimensions means more contrast is captured, but you need more training data. In practice, the number 1000 is chosen because that maxes out the size of my large memory machine. That said, the word2vec paper shows good results with 1000D, so it doesn't seem to be a bad choice.

>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?

Yes! I've tried out PCA, and some spectral biclustering using the off-the-shelf algorithms in SciKits Learn. I only played around with this for an hour or so but got discouraging results. Nevertheless, the word2vec papers actually show that this works really well for projecting France, USA, Paris, DC, London, etc. on a two-dimensional plane where the axis roughly correspond to countried & capitals -- exactly what you'd hope for! I wasn't able to replicate that, but Tomas Mikolov was!

>> It would be interesting to find the word vectors that contain the most variation across all of wikipedia.

Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation' -- would this amount to isolating word clusters and finding the most dense ones? Something like finding a cluster with a hundred variations of the word 'snow' (if you're Inuit)? I'd be willing to part with the raw vector database if there's interest.

>> Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?

Only a little bit, although I'm very interested in finding a faster approach than finding the whole damn dot product (see: https://news.ycombinator.com/item?id=6720359). I worry that traditional location sensitive hashes, kd-trees, and the like work well for 3D locations, but miserably for 1000D data like I have here.

I should reiterate out that most of the hard work revolves around the word2vec algorithm which I used but didn't write. It's awesome, check it and the papers out here: https://code.google.com/p/word2vec/

Whoa, that was alot. Thanks!



Build your kd-tree on the vectors expressed in the eigenvector basis. If the eigenvalues decrease fast enough, you can get bounds on the dot product while going only a few levels deep.


Don't most search engines use an inverted index to find the similarity between the query vector and the document vectors? (instead of doing the dot product with every document)

Maybe something like this could help: http://radimrehurek.com/gensim/similarities/docsim.html

Cool stuff, thanks.


I'd love access to the raw vector database if you feel like sending it my way.




I did 'Facebook - users' and got Myspace as the second result :)


Facebook - Exploitation + Privacy = Blog

Kinda makes sense.


I saw what you wrote about your dot product speed issue. Did you try using NumPy's einsum function? http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...

It's really fast for this kind of stuff. Happy to give details about how to use it if you need.


Hey SandB0x, thanks for the advice! I actually started off by using numpy.dot, which is precisely what's needed. The problem is that I need it to go even faster (currently takes a few seconds) but this function is already heavily optimized and uses Intel MKL to accelerate the math. In fact, my cython implementation would be slower than numpy.dot were it not for some embedded logic that breaks out of the dot product half way through calculation. As I'm computing the row * row, going element by element, if the running sum of those products gets to be really negative (indicating that in the dimensions multiplied so far, the two rows are highly dissimilar) I stop tryin to calculate the rest of the row, thereby saving me from calculating the full dot product. This is cheating obvioisly, but it's x3 or x4 faster numpy.dot. So, because there's branching logic in my implementation of the dot product, I can't express it interns of Einstein summations.


Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (http://en.wikipedia.org/wiki/Cosine_similarity). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching


Thanks! The number crunching is indeed very intense. The full body of vectors has a million rows and thousand columns, which fills up all of the 10GB of available memory. When you punch in a query, it adds or subtracts the requested vectors and takes an approximate dot product between every row in the table and the search query. This is about 10^9 operations. I'm only interested in the most similar (high cosine similarity) dot products, so to speed things up I wrote a Cython dot product implementation. This aborts the calculation if the sum starts to look like it'll be very dissimilar, essentially skipping lots of bad guesses. This speeds things up by a factor of ~5 or so. I'm debating offloading this computation to the GPU, which would be perfect for this.

Edit. In case you're interested in the source: https://github.com/cemoody/wizlang


Since the aim is accuracy rather than throughput, would more memory help?


would you mind pm me @gostermag


Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?

Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.


Hi, creator here (Chris Moody). Great question. The underlying algorithm, word2vec, (https://code.google.com/p/word2vec/) isn't built for streaming data which means that at the moment it assumes a fixed number of words from the beginning of the calculation. Unfortunately, until the state-of-the-art advances to accepting streaming data, the whole corpus will have to be rescanned to accept dynamic content. Furthermore, word2vec doesn't scale past OpenMP, single node, shared-memory resources. So while I used MapReduce, it's just for cleaning and preprocessing the text, not training the vectors, which is the hard part.

So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!


daft punk - repetitive + lyrics == La Roux

nice work!


I guess we all just need a little more LeAnn Rimes. http://www.thisplusthat.me/search/the%20world%20-%20violence...


Sounds like this paper from Google

http://www.technologyreview.com/view/519581

For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.



Does this relate to Latent Semantic Indexing? http://en.wikipedia.org/wiki/Latent_semantic_indexing


Kind of. LSI is a dimensionality-reduction method, which means it takes really sparse "bag of words" vectors and compresses them in a way which approximates the original structure. Word2vec gives each word a non sparse vector based on the hidden units of a neural network language model. The model is trained to predict the middle word from it's surroundings, which makes it learn how words are used in context.


Sounds like another go-around at 1990s (& early 2000s) concept search -- Excite, Northern Light, etc.

And it sounds really close to what I was trying at Elucidate.


Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.

My hack: https://github.com/dhammack/Word2VecExample


Fun bug: handheld - sony + nintendo = {Wii, Wii, Snes}

I was hoping for the DS or Gameboy but expecting at least something handheld.


I would guess that the "handheld" keyword is big with the Wii because of the new control scheme it introduced which was the main (sometimes only) thing talked about with regard to the system


Try: psp - sony + nintendo

Result:

1. Gameboy Advance

2. Nintendo DS

3. Nintendo Gamecube


Interesting. Currently it generates garbage for lot of queries but, some stuff is kinda fun. Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?) Avatar - action + comedy gives The Office (haha!)


I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.


This is neat, and I found a few queries that added interesting results. However, I tried

    Slavoj Žižek - Jacques Lacan - Hegel
which yielded an internal server error, probably due to the diacritics not being encoded properly.


Bug report: using some non-ascii characters crashes the server (for example é or É).


Thanks! I'll have to spend more time sanitizing the input.


Albert Einstein - Smart = Niels Bohr, Werner Heisenberg, Wolfgang Pauli

Ouch, that's cold


Neat stuff juxtaposicion.

Seems like this is how Numenta's AI works: http://www.youtube.com/watch?v=iNMbsvK8Q8Y


Stanford - American + Canadian = University of Toronto

I think it should be Waterloo.


Maybe because you thought CS and the algorithm thought some other field, say math.



Try this slight refinement...

The results were... Interesting.

http://www.thisplusthat.me/search/Dick%20Cheney%20-evil%20%2...


Fantastic work and is relevant to something we are working on in this space. Thanks.

On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary Clinton and Mitt Romney.


Also interesting to try something like:

sleep - sleep

http://www.thisplusthat.me/search/sleep%20-%20sleep



Hey this is pretty cool!

superman - male + female:

  - Lex Luthor (hmm..)
  - Superman's pal Jimmy Olsen (haha, what?)
  - Wonder Woman (That'll do it!)


Great writeup. Curious, are there clear advantages that the vector representation has over graph models (FB graph search, Google Knowledge graph)?


The default example "justin bieber - man + women" was ok, but I have found a better one - "justin bieber - women + man "


What are the use cases for this fancy feature? I'm thinking of e-advisor for fun, but what are the real life serious use cases?


ThisPlusThat.me - fast + slow...

Just kidding! :)

You could also say...

ThisPlusThat.me - another rant + something cool

Thanks for posting this, very interesting work!


I thought this one was good "Stanford - Red + Smart" = Berkeley


Server apparently wasn't ready for HN frontpage load


Pretty impressive for my first try.

iPad - cool -> Windows Phone


reddit - dumb

Expected: HN, Got: Digg




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: