Hi, creator here (Chris Moody).
Great question. The underlying algorithm, word2vec, (https://code.google.com/p/word2vec/) isn't built for streaming data which means that at the moment it assumes a fixed number of words from the beginning of the calculation. Unfortunately, until the state-of-the-art advances to accepting streaming data, the whole corpus will have to be rescanned to accept dynamic content. Furthermore, word2vec doesn't scale past OpenMP, single node, shared-memory resources. So while I used MapReduce, it's just for cleaning and preprocessing the text, not training the vectors, which is the hard part.
So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!
So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!