* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)
* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.
* there are <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens (see Efficient Training of Language Models to Fill in the Middle)
The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.
I know OpenAI has been getting a lot of flack about their seemingly extreme measures of "safety" (and I agree to an extent, although it's more nuanced from my perspective), full kudos to them for open sourcing many useful projects that can serve as the building block for many other projects. From CLIP, to Whisper, and now this project, I do appreciate that effort from their team. So thanks, if you're reading this!
The main complaints against OpenAI have been it's lack of openness and betrayal of it's claimed founding principles.
In that context "open sourcing many useful projects" seems the bare minimum it should be doing, because that was the promise still enshrined in it's name - this is not a bog-standard commercial organisation where that would not be expected.
To be clear OpenAI does deserve massive kudos for it's achievements but openness is not among them.
I hope pypi libraries can provide complete standalone offline versions instead of requests+urllib3+some_object_storage shenanigans.
If these blobs are too large to host it on pypi, maybe give us an alternative way to download it altogether so we can deploy the full lib to a server without network access?
Lots of ML/AI stuff wants to your sign some sort of license before you can use it, being able to deploy stuff offline kind of defeats that. Although of course there are ways around it (download once, store yourself, put in right directory), but they are unlikely to help you work around that.
What's the point of an "open source project" that requires a proprietary data set with seemingly no license? It might as well just the fetch & execute a random executable.
This is most likely because pypi has size restrictions on what you can upload and most users won't go the extra mile to actually pip install from your bespoke download site.
They investigated themselves and found that their products are too powerful for the public, so in the best interest of everyone they had to make the incredibly difficult, but right thing to do, choice of....
Looks like they have 4 different tokenizers. Besides gpt2, does anyone know which models correspond to which tokenizers? One of them is probably Codex.
Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].
Could be used for their wrapper API where they check and alter your input and the models output. This way they could quickly detect "bad words" and so on and return their standard platitudes about how the model can't answer.
Or maybe they do a quick pass on websites and filter out websites with bad words quickly, just leaving a small fraction for the model to train on. Parsing to get the standardized tokens quickly would make that easier.
When I first encountered huggingface (without having seen its logo), I had to think of these guys: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc... - so, by trying to pick a domain that sounded welcoming (and was still available), they achieved the exact opposite (at least in my case)...
In none of your examples the resulting words ("booked", "bocker", "booken") are common nouns which also happen to relate directly to the product. It is different for a tokeniser to include "token" in its name.
I'm aware they both include the full name. That was not the point I was making. Since the difference I was highlighting was unclear:
facebook-ed: Includes the whole name, but nothing else. Obviously bad and not OK.
tik-token: Includes the whole name. Also includes a generic common noun that relates to the product at hand, and is completely fine to include in the name. Appart from that "tik" can not be said to "imply a connection to tiktok". This is fine.
Rust's philosophy is "batteries not included". It's weird coming from Python or Go, where the standard libraries are much larger, but it makes sense for Rust: keep the standard library small and agile (especially as it is statically linked into every binary!) so that it can be backwards-compatible forever and easily ported to different platforms, etc.
For regex specifically, the usual "regex" crate is fantastic and provides very good performance guarantees (no exponential behaviour, ever!), unlike a lot of regex libraries. However, it also comes with the tradeoff of not supporting arbitrary lookahead/lookbehind or many regex extensions, so people needing those features might need a different crate.
And Python's philosophy leads to three different HTTP libraries in the standard library, but everyone still just does "pip install requests".
It's easy to find drawbacks of either extreme. I find Rust's approach of favoring competing, versioned libraries over standard library inclusion generally works well. Many of the libraries managed to significantly improve performance and ergonomics because they can have space to improve their API, and explore which behaviors and guarantees are useful. The biggest downside is that it's more difficult to protect against supply chain attacks.
JavaScript certainly has a standard library, it just really small. Node.js have a bigger standard library (https://nodejs.org/docs/latest-v18.x/api/index.html), including http/https and crypto stuffs and more, they are just not easy to use.
I agree that supply chain is a risk but I think you are creating a false dichotomy.
There is nothing stopping us from having a small standard library and a set of official - but optional - crates for common things like regex. This is the best of both worlds in my view.
It is not a weak standard lib and nobody is making excuses. Your mental mapping doesn't apply here, look into how it works if you are interested as it is different from what you are used to.
rust lifetimes and traits leads to different designs than .net more classical OO with managed memory & runtime. Also the rust team doesn't have Microsoft manpower. It is a bit unfair to say "just do like .net".
To me the right approach is to publish an "official" meta package that depends on a few core packages that work well together.
In 2022, I look for and expect my "next-lang-to-use" to have a basic and workable stdlib.
By 2022 I mean, we live in a "connected and multi core world".
Thus I want at min:
1) fantastic http-client
2) good enough json enc/dec
3) basic-db access
4) concurrency via lang primitives
5) Basic crypto stuff (hash)
Sure it might be overkill or "underkill ?) for some ?
Surely there can't be a microsecond to second ratio between the two steps, because then they would never be able to crawl the hundreds of billions of documents the model is trained on. Once the model is built, sure, the front end is irrelevant.
* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)
* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.
* there are <|fim_prefix|>, <|fim_middle|>, and <|fim_suffix|> tokens (see Efficient Training of Language Models to Fill in the Middle)
The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.