Tiktoken added support for GPT-4o: https://github.com/openai/tiktoken/commit/9d0...

mike_hearn · 2024-05-13T18:14:07 1715624047

Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.

og_kalu · 2024-05-13T20:22:01 1715631721

Yeah and they say as much in the blog.

minimaxir · 2024-05-13T17:32:57 1715621577

For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.

simonw · 2024-05-13T17:58:52 1715623132

Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths?

With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

tedsanders · 2024-05-13T18:19:23 1715624363

Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.

simonw · 2024-05-13T18:36:02 1715625362

Just found there's a whole section about that in this post: https://openai.com/index/hello-gpt-4o/

It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".

kristofferR · 2024-05-13T21:51:51 1715637111

How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.

gemeral · 2024-05-14T06:36:14 1715668574

That's not the right analogy. The "tok" in "Tiktoken" comes from "token", not "TikTok".

meiraleal · 2024-05-18T20:23:15 1716063795

And the "tik" comes from TikTok.

moffkalast · 2024-05-13T18:51:04 1715626264

Lots of those tokens would have to be pixel patches and sound samples right?

nojvek · 2024-05-13T22:09:31 1715638171

Yep. Since it’s multimodal. Pictures, text, audio all go into token space.