After reading this I'm left wondering why anyone should stop using word2vec. The article makes the point that you can produce word vectors using other techniques, in this case by computing probabilities of unigrams and skip-grams and running SVD.
This is all well and good, but from an industry practitioner standpoint this doesn't explain why one would avoid using, or actually stop using word2vec.
1. Several known good word2vec implementations exist, the complexity of the technique doesn't really matter as you can just pick one of these and use it.
2. Pretrained word vectors produced from word2vec and newer algorithms exist for many languages.
Why should someone stop using these and instead spend time implementing a simpler method that produces maybe good enough vectors? Being a simpler method isn't a reason in of itself.
I have the impression from some comments that the term "complexity" in a machine learning context is not well understood. It doesn't mean "complicated" from a human perspective at all, how difficult it is to grok the SVD algorithm vs NN. It actually refers to the complexity of the hypothesis set, which is, intuitively, the number of independent parameters to be learned. What matters in ML is the relative complexity of the hypothesis set (usually very high in deep learning) and the data complexity (sample size). If this relative complexity is too high, overfitting is likely (ie the noise is learned too), generalization bad, and regularization or a simpler model a must.
He's arguing that if you want to develop some custom word embeddings (trained on whatever data set you have specific to task at hand), the SVD approach is often much faster and almost as effective. If you're willing to make that trade, go with SVD is the argument. If you're willing to use a pre-built library, that's certainly fastest of all!
He's arguing that if you want to develop some custom word embeddings (trained on whatever data set you have specific to task at hand), the SVD approach is often much faster
That depends on your corpus and vocabulary size. word2vec is O(n) where n is the length of the training corpus. Vanilla SVD is O(mn^2) for a co-occurence matrix size of m x n. I have been training word2vec and GloVe embeddings on 'web-scale' corpora and training time is usually not a problem.
the SVD approach is often much faster and almost as effective
SVD-derived embeddings does bad on some tasks, such as analogy tasks (see Levy & Goldberg, 2014).
I agree with your parent poster that the author does not really provide good arguments against the use of word2vec et al.
Moreover,
But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer
Which advances in our understanding? SVD on word-word co-occurrence matrices was proposed by Schütze in 1992 ;). There have been many works since then exploring various co-occurrence measures (including PMI) in combination with SVD.
Chris Moody usually delivers great blog posts; this big-worded rehash of an ancient 2014 technique (see
https://rare-technologies.com/making-sense-of-word2vec/ for a more comprehensive eval of SVD vs word2vec vs gloVe) is weird.
But on twitter, Chris promised a sequel post, with extra tricks and tips. So I see this as a warm up :)
Looking forward to part 2.
I thought exactly the same. That was a weird post from Chris, and he has iterated already many times in various talks about SQL-based SVD as a w2v replacement.
Agreed. I actually think the SVD algorithm as it is implemented in most software packages is more complicated than the word2vec algorithm, which is actually quite simple.
Also, Glove (which is one of the newer word2vec alternatives you mentioned) works pretty similar to the approach in the blog post, but has some tweaks that make it perform better than this.
One other benefit of using word2vec-stle training is that you can also control the learning rate, and gracefully handle new training data.
SVD must be done at-once, and you need to use sparse matrix abstractions for raw word vectors. The implementation and abstractions you use actually make it more complex than word2vec imho.
Word2vec can train off pretty much any type of sequence. You can adjust the learning rate on the fly (to emphasize earlier/later events), stop or start incremental training, and with Doc2Vec you can train embeddings for more abstract tokens in a much more straightforward manner (doc ids, user ids, etc.)
While word2vec embeddings are not always reproducible, it is much more stable with the addition of new training data. This is key if you want some stability in a production system over time.
Also, somebody edited the title of the article, thanks! The original title of "Stop using word2vec" is click-bait FUD rubbish. I think in this case we're trying too hard to wring a good discussion out of a bad article.
Well, the thing that word2vec can (and should)
be understood in terms of word coincidences (and pointwise mutual information) is important, but hardly new. I tried to explain it here:
http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html.
There is a temptation to use just the word pair counts, skipping SVD, but it won't yield in the best results. Creating vectors not only compresses data, but also finds general patterns. This compression is super important for less frequent words (otherwise we get a lot of overfitting). See "Why do low dimensional embeddings work better than high-dimensional ones?" from http://www.offconvex.org/2016/02/14/word-embeddings-2/.
I read a blog post maybe a year ago that explained word embeddings under an 'atom interpretation', in which word embeddings are really sparse combinations of 'atoms' (I don't really remember more than that). It was very interesting, but then I forgot about it. Trying to find the post again I only came up with the above paper, which is probably the same idea, but it wasn't what I originally read. Wish I could find it.
Word embeddings are not just useful for text, they can be applied whenever you have relation between "tokens". You can use them to identifying nodes in graphs that belong to the same group[0]. Another, in my opinion, really interesting idea is to apply them to relational databases[1], you can simply ask for similar rows.
It's a interesting article but the author didn't really provide good arguments why I should stop using w2v.
You may be interested in Facebook's recent StarSpace[1] paper, which also shows how this simple "entity embeddings" approach can be used for different tasks.
> Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find them
Word2vec is not deep learning (the skip-gram algorithm is basically a one matrix multiplication followed by softmax, there isn't even place for activation function, why is this deep learning?), and it is simple and efficient. And most of all, there is no overhead using word2vec, just a difference between pre-trained vectors or trained ones.
I don't understand what this article tries to say.
So, where do I get premade versions of this included all words of the 28 largest languages? This is one of the most valuable properties of word2vec and co: Prebuilt versions for many languages, with every word of the dictionary in them.
Once you have that, we can talk about actually replacing word2vec and similar solutions.
Look, let's talk about replacing word2vec because it's old and not that good on its own. Everyone making word vectors, including this article, compares them to word2vec as a baseline because beating word2vec on evaluations is so easy. It's from 2013 and machine learning moves fast these days.
You can replace pre-trained word2vec in 12 languages (with aligned vectors!) with ConceptNet Numberbatch [1]. You can be sure it's better because of the SemEval 2017 results where it came out on top in 4 out of 5 languages and 15 out of 15 aligned language pairs [2]. (You will not find word2vec in this evaluation because it would have done poorly.)
If you want to bring your own corpus, at least update your training method to something like fastText [3], though I recommend looking at how to improve it using ConceptNet anyway, because distributional semantics alone will not get you to the state of the art.
Also: what pre-built word2vec are you using that actually contains valid word associations in many languages? Something trained on just Wikipedia? Al-Rfou's Polyglot? Have you ever actually tested it?
One thing that pre-trained word2vec and GloVe have going for them (and which is incorporated in ConceptNet Numberbatch) is that they're trained on broader corpora, Google News and the Common Crawl respectively. Of course that's what limits them to English, if you're not willing to use a knowledge graph to align things across languages.
Training on just Wikipedia is not a representative model of any language, because not all text is written like an encyclopedia.
If you read my comment carefully, I’m referring to "word2vec and co", meaning similar projects from other sources that use the same approach, and for which similarly large pre-trained corpi exist.
My general criticism is that it’s easy to invent a new method, or popularize an existing method that’s superior. But that’s not the hard task. Building the algorithm is pretty easy.
Getting a corpus of colloquial, professional, and slang texts in dozens of languages, and enough computational power to actually train them all, is usually the hard part for us doing independent research, or working on open source projects without major corporate sponsors. The reason we use word2vec and similar solutions is not because of the approach it uses, but just because it provides a well-working complete package.
Hey, we tried (your) numberbatch vectors (great work, BTW) for a multilingual task, and we had issues because the "words" are conceptnet node names (eg http://conceptnet.io/c/en/natural_language). I understand why it is like this, but is there an approach others are using if they want to use it more like traditional Word2Vec vectors?
Don't know if you tried it on 16.09 when this was harder, but now you can just strip off the /c/en/ and the rest has exactly the same normalization as word2vec. Even including the weird thing with changing multiple digits to # signs.
There is also a Python function included that does the normalization. But it's pretty simple and if you're already using word2vec you're already doing it.
Really? I'll check what version we were using, but the problem I'm referring to was vectors for concepts with the underscore in them. We did work out removing /c/<lang>/
word2vec yields better representations than PMI-SVD. If you want a better explicit PMI matrix factorization, have a look at https://github.com/alexandres/lexvec and the original paper. Explains why SVD performs poorly.
If you are looking for word embedding for production use, checkout fasttext, lexvec, glove, or word2vec. Don't use the approach described in this article.
SVD scales with the number of items cubed, w2v scales linearly. Typical real world vocabularies are 1-10M, not 10-100k. This article is FUD and best, and IMO, just plain BS.
In fact, if the author had actually read Mikolov's w2v NIPS paper (nb, not the arXiv blog posts!) he would have found interesting, insightfull, and, particularly, sound arguments when neutral embeddings work better than count-based PMI - and, when not! As with everything in the world, it's not black-and-white...
My understanding was that you can get some savings from keeping the sparse matrix and running sparse SVD via scipy.sparse.linalg.svds(PMI, k=256). I am not certain about the exact time complexity however.
Some minor space savings, maybe. But SVD runtime still scales with the cube of your vocabulary size. Good luck with SVD on a vocabulary from Wikipedia or Common Crawl. If anything, using traditional count-based approaches is good when you only can use a small corpus with tiny vocabularies (<100k) to develop your word
embeddings. But that's not what this article is proclaiming. Oh, and yeah, do use fasttext, not the good old word2vec.
I don't think you are correct here. The advantage of using sparse storage and sparse matrix multiplication is that you can get savings in both storage and runtime. There would be no point in using sparse storage if runtime still scaled with the cube of vocab size. It would be that way if the best way of getting sparse SVD is by materializing a dense matrix product, but people have discovered smarter ways using sparse matvec. The time complexity for obtaining k eigenvalues seems to be O(dkn) where d is the average number of nonzeroes per row, and n is the vocab size [1]. Therefore, one can assert that sparse SVD too is linear in vocab size just like word2vec.
This is corroborated by the link elsewhere in this thread that shows SVD enjoying lower wall clock time on the 1.9B word Wikipedia dataset [2].
As to [1]: Yes, I was not honest in the sense that non-standard SVD implementations for generating your PMIs will scale with the square of |V|, not the cube. But as I will go on to show, that is not good enough to make count-based approaches competitive to predictive ones.
Re. [2], these measurements by Radim have several issues; First, word2vec is a poor implementation, CPU-usage wise, as can be seen by profiling word2vec (fastText is much better at using your CPUs). Second, even Radim states there that his SVD-based results are significantly poorer than the w2v embeddings ("the quality of both SPPMI and SPPMI-SVD models is atrocious"). Third, Radim's conclusion there is: "TL;DR: the word2vec implementation is still fine and state-of-the-art, you can continue using it :-)".
So I don't really get your points. Instead of referencing websites and blogs, lets take a deeper look at a "proponent" for count-based methods, in a peer-reviewed setting. In Goldberg et al.'s SPPMI model [1,2] they use truncated SVD. (FYI, that proposed model, SPPMI, is what got used in Radim's blog above.) So even if you wanted use SPPMI instead of the sub-optimal SVD (alone), you would first have to find a really good implementation of that, i.e., something that is competitive to fastTest. Also note that Goldberg only used 2 word-windows for SGNS in most comparisons, which makes the results for neural embeddings a bit dubious. You would typically use 5-10, and as shown in Table 5 of [2], SGNS is pretty much the winner on all cases as it "approaches" 10-word window. Next, I would only trust Hill's SimLex as proper evaluation targets for word similarity - simply look at the raw data of the various evaluation datasets yourself and read Hill's explanations why he created SimLex, and I am sure you will agree. "Coincidentally", it also is - by a huge margin - the most difficult dataset to get right (i.e., all approaches perform worst on SimLex). However, SGNS nearly always outperforms SVD/SSPMI on precisely that set. Finally, even Omar et al. had to conclude: "Applying the traditional count-based methods to this setting [=large-scale corpora] proved technically challenging, as they consumed too much memory to be efficiently manipulated." So even if they "wanted" to conclude that SVD is just as good as neural embeddings, their own results (Table 5) and this statment lead us to a clearly different conclusion: If you use enough window size, you are better off with neural embeddings, particularly for large corpora. And this work only compares W2V & GloVe to SVD & SPPMI, while fastText in turn works a lot better than "vanilla" SGNS and GloVe. What I do agree with is that properly tuning neural embeddings is a bit of a black art, much like anything with the "neural" tag on it...
QED; This article is horseradish. Neural embeddings work significantly better than SVD, and SVD is significantly harder to scale to large corpora. Even if you use SPPMI or other tricks.
I don't understand what you mean by "non-standard SVD implementations." No SVD implementation is going to compute more singular vectors than you ask it to. It is neither cubic time, as you first said, nor quadratic time, as you now say. The dimension-dependence is linear.
ADDENDUM: To which I should add, to avoid more discussions, that parallel methods on dense matrices exist that essentially use a prefix sum approach and double the work, but thereby decrease the absolute running time [2]. However, as that exploits parallelism and requires dense matrices, that does not apply to this discussion.
Finally, to tie this discussion off, two truly official references that explicitly address the issue of runtime complexity.
In the best case, as determined by Halko et al., you low-rank k approximation of a n times m term-document matrix is O(nmk), and randomized approximations get that down to O(nm log(k)) [1]. And, according to Rehurek's own investigations [2], those approximated eigenvectors are typically good enough. I.e., in both cases, the decomposition scales with the product of documents and words, not their sum. Therefore, this is clearly not a linear problem.
On top of that, when these inverted indices grow too large to be computed on a single machine, earlier methods required k passes over the data. These newer approaches [1,2] can make do with a single pass, meaning that the thing that indeed scales linearly here is the performance gains of scaling your SVD among a cluster with these newer approaches. Maybe this is the source of confusion for some commenters here.
To clarify on the cubic vs square runtime complexity confusion I caused (sorry!): low-rank (to k ranks) SVD of a n x m word embedding indeed scales with O(nmk), while the full SVD would be O(min(n^2m, nm^2)), i.e., squared and cubed runtime performance, respectively, as per the references to the papers linked elsewhere in this comment branch replying to OP.
This is a great, simple explanation of word vectors. However I think the argument would have been stronger if there were numbers showing that this simplified method and word2vec are similarly accurate like the author claims.
Asking as someone who barely has any clue in this field: is there a way to use this for full-text search, e.g. Lucene? I know from experience that for some languages (e.g. Herew) there are no good stemmers available out of the box, so can you easily build a stemmer/lemmatizer (or even something more powerful? [1]) on top of word2vec or fastText?
[1] E.g., for each word in a document or a search string, it would generate not just its base form, but also a list of top 3 base forms that are different, but similar in meaning to this word's base form (where the meaning is inferred based on context).
You can do all that and more: for example, to find lexical variations of a word, just compute word vectors for the corpus and then search the most similar vectors to a root word, that also contain the first letters (first 3 or 4 letters) of the root. It's almost perfect at finding not only legal variations, but also misspellings.
In general, if you want to search over millions of documents, use Annoy from Spotify. It can index millions of vectors (document vectors for this application) and find similar documents in logarithmic time, so you can search in large tables by fuzzy meaning.
One argument for SVD is the low reliability (as in results fluctuate with repeated experiments) of word2vec embeddings, which hampers (qualitative) interpretation of the resulting embedding spaces, see: http://www.aclweb.org/anthology/C/C16/C16-1262.pdf
Random projections methods are cheaper alternatives to SVD. For example you can bin contexts with a hash function and count collocations between word and binned contexts the same way this article does. Then apply weighting and SVD if you really want the top n principal components.
What's nice with counting methods is that you can simply add matrices from different collections of documents.
I must be missing something here - in step 3, PMI for x, y is calculated as:
log( P(x|y) / ( P(x)P(y) ) )
Because the skipgram probabilities are sparse, P(x|y) is often going to be zero, so taking the log yields negative infinity. The result is a dense PMI matrix filled (mostly) with -Inf.
also, word2vec are super fast and work great. the text has no convincing argument on why not to use them, unless you don't want to learn basic neutral nets. even then, just use Facebook fast text : https://github.com/facebookresearch/fastText
I thought this article was going to talk about GloVe, which actually performs better than Google's word2vec without a neural network according to its paper, but I guess not.
Word2Vec is based on an approach from Lawrence Berkeley National Lab
posted in Bag of Words Meets Bags of Popcorn 3 years ago
2
"Google silently did something revolutionary on Thursday. It open sourced a tool called word2vec, prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning."
“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” https://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learn...
Lawrence Berkeley National Lab was working on an approach more detailed than word2vec (in terms of how the vectors are structured) since 2005 after reading the bottom of their patent: http://www.google.com/patents/US7987191 The Berkeley Lab method also seems much more exhaustive by using a fibonacci based distance decay for proximity between words such that vectors contain up to thousands of scored and ranked feature attributes beyond the bag-of-words approach. They also use filters to control context of the output. It was also made part of search/knowledge discovery tech that won the 2008 R&D100 award http://newscenter.lbl.gov/news-releases/2008/07/09/berkeley-... & http://www2.lbl.gov/Science-Articles/Archive/sabl/2005/March...
We might combine these approaches as there seems to be something fairly important happening here in this area. Recommendations and sentiment analysis seem to be driving the bottom lines of companies today including Amazon, Google, Nefflix, Apple et al.
We don't need w2v precursors from 2005, we got more embeddings that we care to use and we can use random embeddings and train them on project for even better results.
professionals don't use pretrained word2vec vectors in the really complex (like neural machine transnation) deep learning models anymore, they let the models train their own word embeddings directly, or let the models learn character level embeddings.
What exactly do you mean with "they let the models train their own word embeddings", can you elaborate more on this or are there any current papers about this topic?
the embedding layer is the layer that converts the one hot word feature in to a continuous multi dimensional vector that the deep net can learn with.
they used to pretrain that layer separately with word2vec. now as it's just a neural net layer, they let the translation model train it with backprop on the main (translation /dialog / qa, etc) task as a regular layer
I would say it depends on your task, if pre-trained ones performs better, you use pre-trained ones, otherwise you let the model learns the embedding itself.
I find these baiting titles tiresome, and I generally assume (even if it makes an ass out of me) that the author is splitting hairs or wants to grandstand on some inefficiency that most of us knew was there already. (I'm assuming if they had a real argument then it would have been in a descriptive title) With titles like these I'll go straight to the comments section before giving you any ad-revenue. This is HN, not Buzzfeed, and we deserve better than this.
Ok we've taken a crack at finding a representative sentence from the article to use above. If anyone can find a better (i.e. more accurate and neutral) title lurking in the article body somewhere, we can change it again.
You don't need libraries to train word vectors in all languages, you can just load precomputed vectors. In order to measure similarity between two vectors you only need 3 lines of code (it's a simple sum of products for a couple hundred real values).
This is excellent. This is the kind of machine learning we need, that provides understanding instead of "throw this NN at this lump of data and tweak parameters until the error is small enough."
This is all well and good, but from an industry practitioner standpoint this doesn't explain why one would avoid using, or actually stop using word2vec.
1. Several known good word2vec implementations exist, the complexity of the technique doesn't really matter as you can just pick one of these and use it.
2. Pretrained word vectors produced from word2vec and newer algorithms exist for many languages.
Why should someone stop using these and instead spend time implementing a simpler method that produces maybe good enough vectors? Being a simpler method isn't a reason in of itself.