Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.
For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.
Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.
It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".
How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.
It has an increased vocab size of 200k.