Hacker News new | past | comments | ask | show | jobs | submit login

You can also speed up the loading of embeddings by using BPE (byte pair encoding) to segment words into a smaller dictionary of char ngrams, and learning ngram embeddings instead of words.

You can replace a list of 500K words with 50K ngrams, and it also works on unseen words and agglutinative languages such as German. It's interesting that it can both join together frequent words or split into pieces infrequent words, depending on the distribution of characters. Another advantage is that the ngram embedding size is much smaller, thus making it easy to deploy on resource constrained systems such as mobile phones.

Neural Machine Translation of Rare Words with Subword Units

https://arxiv.org/abs/1508.07909a

A Python library for BPE ngrams: sentencepiece

https://github.com/google/sentencepiece





German is not an agglutinative language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: