Tiktoken: OpenAI’s Tokenizer

joelburget · on Dec 16, 2022

A few interesting findings:

* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)

* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.

The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.

forgingahead · on Dec 16, 2022

I know OpenAI has been getting a lot of flack about their seemingly extreme measures of "safety" (and I agree to an extent, although it's more nuanced from my perspective), full kudos to them for open sourcing many useful projects that can serve as the building block for many other projects. From CLIP, to Whisper, and now this project, I do appreciate that effort from their team. So thanks, if you're reading this!

mellosouls · on Dec 16, 2022

The main complaints against OpenAI have been it's lack of openness and betrayal of it's claimed founding principles.

In that context "open sourcing many useful projects" seems the bare minimum it should be doing, because that was the promise still enshrined in it's name - this is not a bog-standard commercial organisation where that would not be expected.

To be clear OpenAI does deserve massive kudos for it's achievements but openness is not among them.

krageon · on Dec 16, 2022

It's not for safety, it is to make money. The "open" storyline is to get relatively cheap goodwill off things that market their paid products.

est · on Dec 16, 2022

Requires az:// blob download.

I hope pypi libraries can provide complete standalone offline versions instead of requests+urllib3+some_object_storage shenanigans.

If these blobs are too large to host it on pypi, maybe give us an alternative way to download it altogether so we can deploy the full lib to a server without network access?

capableweb · on Dec 16, 2022

Lots of ML/AI stuff wants to your sign some sort of license before you can use it, being able to deploy stuff offline kind of defeats that. Although of course there are ways around it (download once, store yourself, put in right directory), but they are unlikely to help you work around that.

yencabulator · on Dec 16, 2022

> Lots of ML/AI stuff wants to your sign some sort of license

What's the point of this MIT license, then: https://github.com/openai/tiktoken/blob/main/LICENSE

capableweb · on Dec 16, 2022

Maybe the license is for the code, not the model itself?

yencabulator · on Dec 16, 2022

What's the point of an "open source project" that requires a proprietary data set with seemingly no license? It might as well just the fetch & execute a random executable.

seemethere · on Dec 16, 2022

This is most likely because pypi has size restrictions on what you can upload and most users won't go the extra mile to actually pip install from your bespoke download site.

phneutral26 · on Dec 16, 2022

Maybe the name is a bit misleading at first sight... But the project is great!

intelVISA · on Dec 16, 2022

Fitting for a company named OpenAI with closed models.

Workaccount2 · on Dec 16, 2022

They investigated themselves and found that their products are too powerful for the public, so in the best interest of everyone they had to make the incredibly difficult, but right thing to do, choice of....

....going closed source and monetizing.

ianai · on Dec 16, 2022

I genuinely hope they consider a rename. I wouldn’t want to use software with such a sleazy connection. It could have been “OAITokenizer”.

echelon · on Dec 16, 2022

This is so great.

1. Name

2. OpenAI is releasing useful stuff

3. Rust in AI!

Vetch · on Dec 16, 2022

NB: huggingface has long had a rust based tokenizer.

thund · on Dec 16, 2022

- python: https://huggingface.co/docs/transformers/model_doc/gpt2#tran...

- javascript: https://www.npmjs.com/package/gpt-3-encoder

- c# https://github.com/dluc/openai-tools

- java: https://www.reddit.com/r/MachineLearning/comments/upej7e/p_j...

- php: https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP

Alifatisk · on Dec 16, 2022

Ruby?

matsadler · on Dec 16, 2022

https://github.com/ankane/tokenizers-ruby

Alifatisk · on Dec 17, 2022

Thank you so much!

ipsum2 · on Dec 16, 2022

Looks like they have 4 different tokenizers. Besides gpt2, does anyone know which models correspond to which tokenizers? One of them is probably Codex.

    "gpt2": gpt2,
    "r50k_base": r50k_base,
    "p50k_base": p50k_base,
    "cl100k_base": cl100k_base,

minimaxir · on Dec 16, 2022

cl100k_base is a new tokenizer that is apparently being used in their new Embeddings project.

Omie6541 · on Dec 16, 2022

can we please have some good example input/outputs in the readme itself? what is the expected output of print(enc.encode("hello world")) ?

mnks · on Dec 16, 2022

Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].

stabbles · on Dec 16, 2022

https://github.com/openai/tiktoken/blob/main/tests/test_simp...

Llamamoe · on Dec 16, 2022

Are GPT models even bottlenecked by input tokenization? What real world speedup does this actually translate into?

Jensson · on Dec 16, 2022

Could be used for their wrapper API where they check and alter your input and the models output. This way they could quickly detect "bad words" and so on and return their standard platitudes about how the model can't answer.

Or maybe they do a quick pass on websites and filter out websites with bad words quickly, just leaving a small fraction for the model to train on. Parsing to get the standardized tokens quickly would make that easier.

FL33TW00D · on Dec 17, 2022

Can someone optimize this further? Seems like there is significant low hanging fruit, as evidenced by this line: https://github.com/openai/tiktoken/blob/main/src/lib.rs#L419

stabbles · on Dec 16, 2022

Sounds like it could be optimized to run 10x faster on a single thread. ~7MB/s is not that fast

mark_l_watson · on Dec 16, 2022

Any information on which human languages it works with? Some languages like Thai can be challenging so I am wondering how general this is.

joelburget · on Dec 16, 2022

It works on all human languages, just inefficiently. I ran it over a sample I found on wikipedia:

    sample = "ฟองมันฟันหนู, ฟันหนูฟองมัน, ฝนทองฟองมัน"
    len(sample), len(enc.encode(sample))

This returns `39, 40` so it's just encoding one character at a time. It's probably like this for almost all non-English text.

theragra · on Dec 16, 2022

Yeah, at least it does it with Russian

black_puppydog · on Dec 16, 2022

Would be curious to know what hardware the benchmark was run on. That drop-off beyond 16 threads is steep...

make3 · on Dec 16, 2022

too bad it's not compatible with huggingface tokenizer configs and need its own

chaoz_ · on Dec 16, 2022

Are there any performance benchmarks or is it not applicable for tokenization?

mixeden · on Dec 16, 2022

At first I thought this is about TikTok

moss2 · on Dec 16, 2022

What's a tokenizer?

sbergot · on Dec 16, 2022

It is a processing step in natural language processing: https://www.analyticsvidhya.com/blog/2020/05/what-is-tokeniz...

teruakohatu · on Dec 16, 2022

Example: Split an input into tokens.

Tokens: [split, an, input, into, tokens, .]

deelly · on Dec 16, 2022

Example: umbrella

Tokens: [umb, rella] ?

artifabrian · on Dec 16, 2022

Could be as well, given there's many varieties of tokenisers, each with different pros and cons.

This particular tokenizer is very interesting given that it tries to be best of both worlds (word-level tokenizer and character-level).

voz_ · on Dec 16, 2022

I do not like the name.

bartread · on Dec 16, 2022

The comparison is with an alternative called "huggingface". At least "Tiktoken" gives some clue as to what the component is for.

rob74 · on Dec 21, 2022

When I first encountered huggingface (without having seen its logo), I had to think of these guys: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc... - so, by trying to pick a domain that sounded welcoming (and was still available), they achieved the exact opposite (at least in my case)...

charcircuit · on Dec 16, 2022

This project has nothing to do with TikTok though

ianai · on Dec 16, 2022

And the name implies there’s a connection. Which is problematic.

jraph · on Dec 16, 2022

Of course, it's because you are not tokenizing it right. It's tik-token.

(Could we say it is kind of a garden-path word?)

_Algernon_ · on Dec 16, 2022

It really doesn't. Does every name that has "book" or "face" in the name, imply a connection to Facebook?

fragmede · on Dec 16, 2022

Maybe not but Facebooker, Facebooked, Facebocker, and Facebooken all sound like they do.

_Algernon_ · on Dec 16, 2022

In none of your examples the resulting words ("booked", "bocker", "booken") are common nouns which also happen to relate directly to the product. It is different for a tokeniser to include "token" in its name.

ianai · on Dec 16, 2022

Which is clearly why they pointed out the use of the whole name of the service in both cases.

_Algernon_ · on Dec 16, 2022

I'm aware they both include the full name. That was not the point I was making. Since the difference I was highlighting was unclear:

facebook-ed: Includes the whole name, but nothing else. Obviously bad and not OK.

tik-token: Includes the whole name. Also includes a generic common noun that relates to the product at hand, and is completely fine to include in the name. Appart from that "tik" can not be said to "imply a connection to tiktok". This is fine.

TechBro8615 · on Dec 16, 2022

I see "tiktok en" like en.wikipedia.org or microsoft.com/en

jeroenvlek · on Dec 16, 2022

Huggingface is a lot more than just a tokenizer.

DanSmooth · on Dec 16, 2022

Same. A shame "Token McTokenface" seems to be already in use. What about "Sentence Splitter Upper"?

More serious "qiktoken" would at least hint at the use for the package.

thatguyagain · on Dec 16, 2022

It's just a little humor

nulltype · on Dec 16, 2022

What a great name!

worldsavior · on Dec 16, 2022

Why regex is not in rust stdlib? It's one of the fundamental libraries out there.

nneonneo · on Dec 16, 2022

Rust's philosophy is "batteries not included". It's weird coming from Python or Go, where the standard libraries are much larger, but it makes sense for Rust: keep the standard library small and agile (especially as it is statically linked into every binary!) so that it can be backwards-compatible forever and easily ported to different platforms, etc.

For regex specifically, the usual "regex" crate is fantastic and provides very good performance guarantees (no exponential behaviour, ever!), unlike a lot of regex libraries. However, it also comes with the tradeoff of not supporting arbitrary lookahead/lookbehind or many regex extensions, so people needing those features might need a different crate.

There's a nice discussion on another thread: https://news.ycombinator.com/item?id=33510976

dmitriid · on Dec 16, 2022

> Rust's philosophy is "batteries not included".

In Javascript this led to, well, npm. Because there's no standard library, and everyone ended up reinventing everything.

wongarsu · on Dec 16, 2022

And Python's philosophy leads to three different HTTP libraries in the standard library, but everyone still just does "pip install requests".

It's easy to find drawbacks of either extreme. I find Rust's approach of favoring competing, versioned libraries over standard library inclusion generally works well. Many of the libraries managed to significantly improve performance and ergonomics because they can have space to improve their API, and explore which behaviors and guarantees are useful. The biggest downside is that it's more difficult to protect against supply chain attacks.

sbergot · on Dec 16, 2022

I am not sure that python is a shining example for library management. Cargo seems well designed. And the package size is more a cultural thing.

maple3142 · on Dec 16, 2022

JavaScript certainly has a standard library, it just really small. Node.js have a bigger standard library (https://nodejs.org/docs/latest-v18.x/api/index.html), including http/https and crypto stuffs and more, they are just not easy to use.

hardware2win · on Dec 16, 2022

Weak standard lib is poor approach imo.

Supply chains are real and being scared to design good API that will survive test of time sounds like poor excuse

At worst you could steal API design from some mature and battle tested base class libs like .net

pharmakom · on Dec 16, 2022

I agree that supply chain is a risk but I think you are creating a false dichotomy.

There is nothing stopping us from having a small standard library and a set of official - but optional - crates for common things like regex. This is the best of both worlds in my view.

recuter · on Dec 16, 2022

It is not a weak standard lib and nobody is making excuses. Your mental mapping doesn't apply here, look into how it works if you are interested as it is different from what you are used to.

JamesSwift · on Dec 16, 2022

Yeah, from experience spinning up in Rust it is definitely not a good first-run experience.

- You type something in expecting autocomplete, its not there. "Huh, thats weird"

- Look it up online. Sure enough, its not in the standard library. OK, what should I use then?

- Do research to figure out what 3rd party crate to use

- Tend to abstract/isolate those bits in case you need to rip it out for a different implementation later

sbergot · on Dec 16, 2022

rust lifetimes and traits leads to different designs than .net more classical OO with managed memory & runtime. Also the rust team doesn't have Microsoft manpower. It is a bit unfair to say "just do like .net".

To me the right approach is to publish an "official" meta package that depends on a few core packages that work well together.

nurettin · on Dec 16, 2022

Weak stdlib is all fun and games until someone deletes their crate.

rawoke083600 · on Dec 16, 2022

In 2022, I look for and expect my "next-lang-to-use" to have a basic and workable stdlib. By 2022 I mean, we live in a "connected and multi core world". Thus I want at min: 1) fantastic http-client 2) good enough json enc/dec 3) basic-db access 4) concurrency via lang primitives 5) Basic crypto stuff (hash)

Sure it might be overkill or "underkill ?) for some ?

pifm_guy · on Dec 16, 2022

I don't understand why you'd want a tokenizer to be fast...

Surely the output of said tokenizer you're going to stick through a machine learning model that is orders of magnitude slower?

So it really doesn't matter if the tokenizer takes microseconds or milliseconds, when the main model takes seconds for the same input/output.

manholio · on Dec 16, 2022

Surely there can't be a microsecond to second ratio between the two steps, because then they would never be able to crawl the hundreds of billions of documents the model is trained on. Once the model is built, sure, the front end is irrelevant.

pifm_guy · on Dec 16, 2022

The ratio between the tokenizer and the model is constant for both training and inference.

And I believe the ratio is that big - you just use a lot of compute for training, and that's why it's so expensive.

bertday · on Dec 16, 2022

Inference is latency sensitive so the front end is still relevant.