The article is talking about something called BPE but it's not defined anywhere. Chasing links I see it's about tokenization and how it's "weird".
> BPE tries to be efficient, so it doesn’t waste token slots on spaces if it doesn’t have to. A word is almost always preceded by a space, so instead of representing “ example text” as four tokens (space, “example,” space, “text”), it represents it as two:
> [(' Example', 17934), (' text', 2420)]
Apparently this is a problem because when feeding prompts to the system you will get different results.
> So far, seems innocuous, right? But what if you’re feeding a prompt into GPT-2? Unless you’re hip to this particular issue, you’ll probably type in something like
> “Example text”
> which becomes
> [('Example', 16281), (' text', 2420)]
> Compare this to the one above. Yes – instead of token #17934, with the preceding space, I’ve unwittingly fed in token #16281, without a preceding space.
Maybe someone should figure out how to write a neural network for tokenizing text in a more natural way since it seems like the tokenizer is hand crafted and is essentially a hyperparameter that can not be optimized with gradient descent.
> BPE tries to be efficient, so it doesn’t waste token slots on spaces if it doesn’t have to. A word is almost always preceded by a space, so instead of representing “ example text” as four tokens (space, “example,” space, “text”), it represents it as two:
> [(' Example', 17934), (' text', 2420)]
Apparently this is a problem because when feeding prompts to the system you will get different results.
> So far, seems innocuous, right? But what if you’re feeding a prompt into GPT-2? Unless you’re hip to this particular issue, you’ll probably type in something like
> “Example text”
> which becomes
> [('Example', 16281), (' text', 2420)]
> Compare this to the one above. Yes – instead of token #17934, with the preceding space, I’ve unwittingly fed in token #16281, without a preceding space.
Maybe someone should figure out how to write a neural network for tokenizing text in a more natural way since it seems like the tokenizer is hand crafted and is essentially a hyperparameter that can not be optimized with gradient descent.