Hacker News new | past | comments | ask | show | jobs | submit login

Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?

e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?

Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?




I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.



Much larger tokens require a much larger token vocabulary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: