Why don’t they also have single letters as tokens?

potatoinexhaust · on March 21, 2023

They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).

ben_w · on March 21, 2023

My wild guess is that if it could get things done by tokenising like that all the time, they wouldn't need to also have word-like tokens.

If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.