GPT-3 and Arithmetic

memexy · on June 12, 2020

The article is talking about something called BPE but it's not defined anywhere. Chasing links I see it's about tokenization and how it's "weird".

> BPE tries to be efficient, so it doesn’t waste token slots on spaces if it doesn’t have to. A word is almost always preceded by a space, so instead of representing “ example text” as four tokens (space, “example,” space, “text”), it represents it as two:

> [(' Example', 17934), (' text', 2420)]

Apparently this is a problem because when feeding prompts to the system you will get different results.

> So far, seems innocuous, right? But what if you’re feeding a prompt into GPT-2? Unless you’re hip to this particular issue, you’ll probably type in something like

> “Example text”

> which becomes

> [('Example', 16281), (' text', 2420)]

> Compare this to the one above. Yes – instead of token #17934, with the preceding space, I’ve unwittingly fed in token #16281, without a preceding space.

Maybe someone should figure out how to write a neural network for tokenizing text in a more natural way since it seems like the tokenizer is hand crafted and is essentially a hyperparameter that can not be optimized with gradient descent.