Hacker News new | past | comments | ask | show | jobs | submit login
Fast word vectors with little memory usage in Python (github.com/thoughtriver)
111 points by jinqueeny on Sept 5, 2018 | hide | past | favorite | 25 comments



This is interesting work. We at Plasticity (YC S17) have open sourced something similar called Magnitude (https://github.com/plasticityai/magnitude) a few months ago for querying vector embeddings quickly with low memory usage using SQLite and a standard universal file format (.magnitude) between word2vec, fastText, and GloVe. We also added features for out-of-vocabulary lookups, bulk queries, concatenating embeddings from multiple models, etc. It may also be of interest to folks looking at this.

We also have a paper on Magnitude we will be presenting at EMNLP 2018 and ELMo support is coming soon!


SQLite (and SQL in general) is ridiculously slow for specialized applications like this. http://www.lmdb.tech/bench/microbench/


It depends what you are optimizing for. It's a little slower for random disk reads, but it doesn't matter given we utilize an in-memory LRU cache and most text has a Zipf-ian distribution which will hit the cache very often even with a small cache size. (see my comment elsewhere on this thread)

Developers are more familiar with SQLite, it's bundled with Python, it is easy to add to nearly every language, and it has specialized features we use like the Full Text Search module for out of vocabulary lookups.


Wrappers for LMDB are available for nearly every language as well. https://symas.com/lmdb/technical/#wrappers


Oh wow, how did I not know about this :) Looks super useful.

How would ELMo work if a neural network needs to be run?


Thanks! Still working on it, but it will embed the ELMo model in the .magnitude file, wrap the ELMo model interface code so that it is standardized with Magnitude's features (querying, concatenating, iterating, etc.) and will use ELMo's OOV method natively.


Note also:

* reddit (Storage and retrieval of Word Embeddings in various databases): https://www.reddit.com/r/LanguageTechnology/comments/9cf0zk/...

* ... links to this GitHub repo: https://github.com/martialblog/word_embedding_storage


I have tried plasticity (even submitted a PR*) before putting together this repo but found it horrendously slow in comparison. It really does depend though; as plasticity utilise an in-mem cache if you're querying on many similar words repeatedly, that is likely to be faster.

EDIT: sorry it wasn't a PR but an issue


Note that the original word2vec binary format is extremely bad for low-memory use. It stores the words and vectors interleaved (and words are obviously of a variable length). Of course, you could circumvent this problem by building a separate (sorted) index file.

However, newer formats, such as the fastText binary format, store the embedding matrix contiguously. In such formats you can just memory-map the embedding matrix and you only have to load the vocabulary into memory. This is even simpler than the approach described here and in the Delft README, you don't have any serialization/deserialization overhead [1], you can let the OS decide how much to cache in memory, and you have one dependency less (mmap is in POSIX [2]).

[1] Of course, if you have a system with different endianness, you have to do byte swapping.

[2] http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html


Note that like the article says, memory-mapping (mmap) vectors from disk already works in Gensim:

https://radimrehurek.com/gensim/models/keyedvectors.html

The low memory footprint in LMDB is interesting. There were also some exciting papers recently that used compression (vector quantization) [0]. We're still evaluating whether to add those to Gensim [1].

[0] word2bits, https://twitter.com/RadimRehurek/status/976015607941517312

[1] word2bits PR, https://github.com/RaRe-Technologies/gensim/pull/2011#issuec...


Note that LMDB uses mmap already, and has the same advantages you list. (no ser/deser, OS controls cache).


But the embedding matrix will not be stored contiguously, which has benefits, especially when frequent words are stored together (which happens in some implementations, because they sort the vocabulary by frequency), due to page-granularity of mapping, etc. Also, in this particular implementation:

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach such as msgpack

So, they use serialization/deserialization.



You can also speed up the loading of embeddings by using BPE (byte pair encoding) to segment words into a smaller dictionary of char ngrams, and learning ngram embeddings instead of words.

You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.

Neural Machine Translation of Rare Words with Subword Units

https://arxiv.org/abs/1508.07909a

A Python library for BPE ngrams: sentencepiece

https://github.com/google/sentencepiece



German is not an agglutinative language.


I almost went a mem-mapped approach for a version I wrote in java that uses fork-join: https://github.com/spennihana/FasterWordEmbeddings

Edit: this work uses a different byte-layout + parallel reader that heaves the word vecs into memory as compressed byte arrays. Load time is seconds (haven’t benchmarked with current ssds). Memory footprint is on the order of the size of your word vecs (memory is cheap for me, but could easily be extended to support mem-mapping if memory resources are scarce).


Please add a license to the source code? Thanks.


Hi thanks for your interest. I have added a GPL v3 license now.

Thanks


Very nice work! :) Are there any benchmarks available? I'm curious how this compares to caching frequent word vectors (Zipf's law helps here) and disk-seeking the rest.


See my top-level reply about a similar library called Magnitude that we have built and open-sourced: https://github.com/plasticityai/magnitude.

It actually utilizes an LRU cache (configurable cache size argument on the constructor) so you can utilize an in-memory LRU cache as you query words off-disk using SQLite indexed disk-seeks. And you're right, due to Zipf-ian properties of most text, you can see gains even with a small in-memory cache size :).


so far i've been using rocksdb for this use case, is there any benchmark that would compare both dbs ?


We use rocks in production and research and have used LMDB for research purposes. Here's a benchmark:

https://www.influxdata.com/blog/benchmarking-leveldb-vs-rock...

I think rocks gives better read+ write performance together whereas lmdb is heavily skewed to read performance in our experience -- which is mirrored by the benchmark.


It's not as clearcut as that. RocksDB gives better write performance up to record sizes of about 1/4 page size. Above that size, LMDB write performance is superior.

http://www.lmdb.tech/bench/ondisk/

Also the InfluxDB blog post is missing quite a lot of discussion that originally occurred here https://disqus.com/home/discussion/influxdb/benchmarking_lev...


Awesome did not know this thanks!




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: