You can also speed up the loading of embeddings by using BPE (byte pair encoding... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

visarga on Sept 6, 2018 | parent | context | favorite | on: Fast word vectors with little memory usage in Pyth...

You can also speed up the loading of embeddings by using BPE (byte pair encoding) to segment words into a smaller dictionary of char ngrams, and learning ngram embeddings instead of words.

You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.

Neural Machine Translation of Rare Words with Subword Units

https://arxiv.org/abs/1508.07909a

A Python library for BPE ngrams: sentencepiece

https://github.com/google/sentencepiece

stochastic_monk on Sept 6, 2018 | [–]

Fixed arxiv link: https://arxiv.org/abs/1508.07909

rubberpoliceman on Sept 7, 2018 | [–]

German is not an agglutinative language.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact