Hacker News new | past | comments | ask | show | jobs | submit login
Tiktoken: OpenAI’s Tokenizer (github.com/openai)
153 points by azhenley on Dec 16, 2022 | hide | past | favorite | 74 comments



A few interesting findings:

* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)

* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.

* there are <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens (see Efficient Training of Language Models to Fill in the Middle)

The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.


I know OpenAI has been getting a lot of flack about their seemingly extreme measures of "safety" (and I agree to an extent, although it's more nuanced from my perspective), full kudos to them for open sourcing many useful projects that can serve as the building block for many other projects. From CLIP, to Whisper, and now this project, I do appreciate that effort from their team. So thanks, if you're reading this!


The main complaints against OpenAI have been it's lack of openness and betrayal of it's claimed founding principles.

In that context "open sourcing many useful projects" seems the bare minimum it should be doing, because that was the promise still enshrined in it's name - this is not a bog-standard commercial organisation where that would not be expected.

To be clear OpenAI does deserve massive kudos for it's achievements but openness is not among them.


It's not for safety, it is to make money. The "open" storyline is to get relatively cheap goodwill off things that market their paid products.


Requires az:// blob download.

I hope pypi libraries can provide complete standalone offline versions instead of requests+urllib3+some_object_storage shenanigans.

If these blobs are too large to host it on pypi, maybe give us an alternative way to download it altogether so we can deploy the full lib to a server without network access?


Lots of ML/AI stuff wants to your sign some sort of license before you can use it, being able to deploy stuff offline kind of defeats that. Although of course there are ways around it (download once, store yourself, put in right directory), but they are unlikely to help you work around that.


> Lots of ML/AI stuff wants to your sign some sort of license

What's the point of this MIT license, then: https://github.com/openai/tiktoken/blob/main/LICENSE


Maybe the license is for the code, not the model itself?


What's the point of an "open source project" that requires a proprietary data set with seemingly no license? It might as well just the fetch & execute a random executable.


This is most likely because pypi has size restrictions on what you can upload and most users won't go the extra mile to actually pip install from your bespoke download site.


Maybe the name is a bit misleading at first sight... But the project is great!


Fitting for a company named OpenAI with closed models.


They investigated themselves and found that their products are too powerful for the public, so in the best interest of everyone they had to make the incredibly difficult, but right thing to do, choice of....

....going closed source and monetizing.


I genuinely hope they consider a rename. I wouldn’t want to use software with such a sleazy connection. It could have been “OAITokenizer”.


This is so great.

1. Name

2. OpenAI is releasing useful stuff

3. Rust in AI!


NB: huggingface has long had a rust based tokenizer.



Ruby?



Thank you so much!


Looks like they have 4 different tokenizers. Besides gpt2, does anyone know which models correspond to which tokenizers? One of them is probably Codex.

    "gpt2": gpt2,
    "r50k_base": r50k_base,
    "p50k_base": p50k_base,
    "cl100k_base": cl100k_base,


cl100k_base is a new tokenizer that is apparently being used in their new Embeddings project.


can we please have some good example input/outputs in the readme itself? what is the expected output of print(enc.encode("hello world")) ?


Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].



Are GPT models even bottlenecked by input tokenization? What real world speedup does this actually translate into?


Could be used for their wrapper API where they check and alter your input and the models output. This way they could quickly detect "bad words" and so on and return their standard platitudes about how the model can't answer.

Or maybe they do a quick pass on websites and filter out websites with bad words quickly, just leaving a small fraction for the model to train on. Parsing to get the standardized tokens quickly would make that easier.


Can someone optimize this further? Seems like there is significant low hanging fruit, as evidenced by this line: https://github.com/openai/tiktoken/blob/main/src/lib.rs#L419


Sounds like it could be optimized to run 10x faster on a single thread. ~7MB/s is not that fast


Any information on which human languages it works with? Some languages like Thai can be challenging so I am wondering how general this is.


It works on all human languages, just inefficiently. I ran it over a sample I found on wikipedia:

    sample = "ฟองมันฟันหนู, ฟันหนูฟองมัน, ฝนทองฟองมัน"
    len(sample), len(enc.encode(sample))
This returns `39, 40` so it's just encoding one character at a time. It's probably like this for almost all non-English text.


Yeah, at least it does it with Russian


Would be curious to know what hardware the benchmark was run on. That drop-off beyond 16 threads is steep...


too bad it's not compatible with huggingface tokenizer configs and need its own


Are there any performance benchmarks or is it not applicable for tokenization?


At first I thought this is about TikTok


What's a tokenizer?


It is a processing step in natural language processing: https://www.analyticsvidhya.com/blog/2020/05/what-is-tokeniz...


Example: Split an input into tokens.

Tokens: [split, an, input, into, tokens, .]


Example: umbrella

Tokens: [umb, rella] ?


Could be as well, given there's many varieties of tokenisers, each with different pros and cons.

This particular tokenizer is very interesting given that it tries to be best of both worlds (word-level tokenizer and character-level).


I do not like the name.


The comparison is with an alternative called "huggingface". At least "Tiktoken" gives some clue as to what the component is for.


When I first encountered huggingface (without having seen its logo), I had to think of these guys: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc... - so, by trying to pick a domain that sounded welcoming (and was still available), they achieved the exact opposite (at least in my case)...


This project has nothing to do with TikTok though


And the name implies there’s a connection. Which is problematic.


Of course, it's because you are not tokenizing it right. It's tik-token.

(Could we say it is kind of a garden-path word?)


It really doesn't. Does every name that has "book" or "face" in the name, imply a connection to Facebook?


Maybe not but Facebooker, Facebooked, Facebocker, and Facebooken all sound like they do.


In none of your examples the resulting words ("booked", "bocker", "booken") are common nouns which also happen to relate directly to the product. It is different for a tokeniser to include "token" in its name.


Which is clearly why they pointed out the use of the whole name of the service in both cases.


I'm aware they both include the full name. That was not the point I was making. Since the difference I was highlighting was unclear:

facebook-ed: Includes the whole name, but nothing else. Obviously bad and not OK.

tik-token: Includes the whole name. Also includes a generic common noun that relates to the product at hand, and is completely fine to include in the name. Appart from that "tik" can not be said to "imply a connection to tiktok". This is fine.


I see "tiktok en" like en.wikipedia.org or microsoft.com/en


Huggingface is a lot more than just a tokenizer.


Same. A shame "Token McTokenface" seems to be already in use. What about "Sentence Splitter Upper"?

More serious "qiktoken" would at least hint at the use for the package.


It's just a little humor


What a great name!


Why regex is not in rust stdlib? It's one of the fundamental libraries out there.


Rust's philosophy is "batteries not included". It's weird coming from Python or Go, where the standard libraries are much larger, but it makes sense for Rust: keep the standard library small and agile (especially as it is statically linked into every binary!) so that it can be backwards-compatible forever and easily ported to different platforms, etc.

For regex specifically, the usual "regex" crate is fantastic and provides very good performance guarantees (no exponential behaviour, ever!), unlike a lot of regex libraries. However, it also comes with the tradeoff of not supporting arbitrary lookahead/lookbehind or many regex extensions, so people needing those features might need a different crate.

There's a nice discussion on another thread: https://news.ycombinator.com/item?id=33510976


> Rust's philosophy is "batteries not included".

In Javascript this led to, well, npm. Because there's no standard library, and everyone ended up reinventing everything.


And Python's philosophy leads to three different HTTP libraries in the standard library, but everyone still just does "pip install requests".

It's easy to find drawbacks of either extreme. I find Rust's approach of favoring competing, versioned libraries over standard library inclusion generally works well. Many of the libraries managed to significantly improve performance and ergonomics because they can have space to improve their API, and explore which behaviors and guarantees are useful. The biggest downside is that it's more difficult to protect against supply chain attacks.


I am not sure that python is a shining example for library management. Cargo seems well designed. And the package size is more a cultural thing.


JavaScript certainly has a standard library, it just really small. Node.js have a bigger standard library (https://nodejs.org/docs/latest-v18.x/api/index.html), including http/https and crypto stuffs and more, they are just not easy to use.


Weak standard lib is poor approach imo.

Supply chains are real and being scared to design good API that will survive test of time sounds like poor excuse

At worst you could steal API design from some mature and battle tested base class libs like .net


I agree that supply chain is a risk but I think you are creating a false dichotomy.

There is nothing stopping us from having a small standard library and a set of official - but optional - crates for common things like regex. This is the best of both worlds in my view.


It is not a weak standard lib and nobody is making excuses. Your mental mapping doesn't apply here, look into how it works if you are interested as it is different from what you are used to.


Yeah, from experience spinning up in Rust it is definitely not a good first-run experience.

- You type something in expecting autocomplete, its not there. "Huh, thats weird"

- Look it up online. Sure enough, its not in the standard library. OK, what should I use then?

- Do research to figure out what 3rd party crate to use

- Tend to abstract/isolate those bits in case you need to rip it out for a different implementation later


rust lifetimes and traits leads to different designs than .net more classical OO with managed memory & runtime. Also the rust team doesn't have Microsoft manpower. It is a bit unfair to say "just do like .net".

To me the right approach is to publish an "official" meta package that depends on a few core packages that work well together.


Weak stdlib is all fun and games until someone deletes their crate.


In 2022, I look for and expect my "next-lang-to-use" to have a basic and workable stdlib. By 2022 I mean, we live in a "connected and multi core world". Thus I want at min: 1) fantastic http-client 2) good enough json enc/dec 3) basic-db access 4) concurrency via lang primitives 5) Basic crypto stuff (hash)

Sure it might be overkill or "underkill ?) for some ?


I don't understand why you'd want a tokenizer to be fast...

Surely the output of said tokenizer you're going to stick through a machine learning model that is orders of magnitude slower?

So it really doesn't matter if the tokenizer takes microseconds or milliseconds, when the main model takes seconds for the same input/output.


Surely there can't be a microsecond to second ratio between the two steps, because then they would never be able to crawl the hundreds of billions of documents the model is trained on. Once the model is built, sure, the front end is irrelevant.


The ratio between the tokenizer and the model is constant for both training and inference.

And I believe the ratio is that big - you just use a lot of compute for training, and that's why it's so expensive.


Inference is latency sensitive so the front end is still relevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: