Has anyone worked on making tokens 'clusters of words with specific semantic mea...

pizza · 2024-10-25T07:09:38 1729840178

I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.

visarga · 2024-10-25T03:34:03 1729827243

yes, look up Byte Pair Encoding

https://huggingface.co/learn/nlp-course/chapter6/5

dragonwriter · 2024-10-25T01:09:52 1729818592

Much larger tokens require a much larger token vocabulary.