Has anyone worked on making tokens 'clusters of words with specific semantic meaning'?
e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?
Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?
I think you might be thinking of applying a kind of low-rank decomposition to the vocabulary embeddings. A quick search on Google Scholar suggests that this might be useful in the context of multilingual tokenization.
e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?
Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?