Hacker News new | past | comments | ask | show | jobs | submit login

Why don’t they also have single letters as tokens?



They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).


My wild guess is that if it could get things done by tokenising like that all the time, they wouldn't need to also have word-like tokens.

If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: