Hi everyone! Creator of Transformers.js here :) ...
Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!
As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!
Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).
---
To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.
Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.
Yes, there are some workarounds you can do to get it working in non-browser environments. I do aim to get a permanent solution, which will ideally work out-of-the-box for both browser and node/deno environments.
Some other users also reported the issue (which stems from a bug in onnxruntime-web), and we were able to get it working in these cases:
I really liked the suggestion that if it takes off, the web should consider trying to expose something like the OpenXLA intermediate model, which powers the new PyTorch 2.0, TensorFlow, Jax, and a bunch of other top tier ML frameworks.
It already is very well optimized for a ton of hardware (cpus, gpus, ml chips). The Intermediate Representation might already be a web-safe-ish model, effectively self-sandboxing, which could make it safe to expose.
Making each webapp target & optimize ML for every possible device target sounds terrible.
The purpose of MLIR is that most of the optimization can be done at lower levels. Instead of everyone figuring out & deciding on their own how best to target & optimize for js, wasm, webgl, and/or webgpu, you just use the industry standard intermediate representation & let the browser figure out the tradeoffs. If there is inboard hardware, neural cores, they might just work!
Good to see WebML has OpenXLA on their radar... but also a bit afraid, expecting some half ass excuses why of course we're going to make some brand new other thing instead. The web & almost everyone else has such a bad NIH problem. WASI & web file apis being totally different is one example, where there's just no common cause, even though it'd make all the difference. And with ML, the cost of having your own tech versus being able to re-use the work everyone else puts on feels like a near suicidal decision to make an API that will never be good, never perform anywhere where near it could.
> Making each webapp target & optimize ML for every possible device target sounds terrible.
Yes it does.
Did something I said imply that?
OpenXLA is an intermediate layer that frameworks like PyTorch or JAX can use. It has pluggable backends, and so if there was a web-compatible backend (WebGL or WASM) then everyone could use it and all models that were built using something that used OpenXLA[1] would be compatible.
[1] Not 100% sure how low-level the OpenXLA intermediate representation is. I know it's not uncommon when porting a brand new primitive (eg a special kind of transformer etc) to a new architecture (eg CUDA->Apple M1) that some operations aren't yet supported, so this might be similar.
I support having web targets. It'd be a good offering.
But it feels upside down to me from what we really all should want, which is a safe way to let the web target any backend you have. WebGPU or WebGL or wasm are going to be OK targets, but with limited hardware support & tons of constraints that mean they won't perform as well as openxla.
Also how will these targets get profiled? Do we ship the same WebGL to a 600w monster as a rpi?
There's a lot of really good reasons to want OpenXLA under the browser, rather than above/before it.
> WebGPU or WebGL or wasm are going to be OK targets, but with limited hardware support & tons of constraints that mean they won't perform as well as wasm.
I don't understand. "WebGPU or WebGL or wasm".. "won't perform as well as wasm".
I don't think a high level representation is necessary for relatively straightforward FMA extensions (either outer products in the case of Apple AMX or matrix products in the case of CUDA/Intel AMX). WebGPU + tensor core support and WASM + AMX support would be simpler to implement, likely more future proof and wouldn't require maintaining a massive layer of abstraction.
The issue is, much of the performance of Pytorch, JAX, et al comes from running a JIT that is tuned to the underlying HW, and come with support for high level intrinsic operations that were either hand-tuned or have extra hardware support, especially ops dealing with parallelizing computation across multiple cores.
You'd probably end up representing these as external library function calls in WASM, but then the WASM JIT would have to be taught that these are magic functions that are potentially treated specially, so at that point you're just embedding HLO ops as library func, and them embedding an HLO translator into the WASM runtime, I'm not sure that's any better.
By analogy would be be better to eliminate fragment and vertex shaders and just use WASM for sending shaders to the GPU, or is the domain specific language and its constraints beneficial to the GPU drivers?
Do we leave it to every web app to figure out how best to serve everyone, and have them bundle their own tuning optimizers into each app? Or do we bake in a higher level abstraction that works for everyone that the browser itself will be able to help optimize?
There's some risk & the browser apis likely won't come with all the escape-hatches the full tools might have to manually jigger with optimizations, but the idea of getting everyone to DIY seems like a promise of misfit: way too much code when you don't need it, way not enough tuning when you do need it. And there's other risks; the assurity that oh we just need one or maybe two ops on the web & then everything will be fine forever doesn't wash with me. If we make new ops the old code won't use it.
And what about hardware that doesn't have any presence on the web; lots of cheap embedded cores have a couple tflops of neural coprocessing, but neither wasm nor webgpu can target that atm, it's much too simple a core for that kind of dynamic execution; it's the sea of weird expansive hardware that OpenXLA helps one target (and target very well indeed) that is it's chief capability, and I can't imagine forgoing a middleman abstraction like it.
checkout https://mlc.ai/web-stable-diffusion, which is builds on top of Apache TVM and brings in models from PyTorch2.0, ONNX and other means into the ML compilation flow
Hah, ChatGPT has successfully poisoned the well. Well done sama.
This lib is great work, a JS interface for running HF models. The comments about how "bad" the outputs are as surprising to me as they are alarming.
OAI has now set the zero-effort bar so high that even HNers (who click on .js headlines) fall into the gap they've left. That sucking sound you hear is market share being hoovered up.
No they're not mate, it's just you. I've read the guidelines (thanks for helpfully linking them). I see this on HN, people infer offense and cite the book rather than engage.
By not highlighting what you found "snarky" your response is a definitional "shallow dismissal". I see you just "picked the most provocative thing to complain about". Not a lot of being "kind" either.
So you know what would also be great? If you held yourself to the standards you're keen to police around here.
This is the third time that a candidate has been elected. In this article I will use the names of the candidates and the candidates.
In 2016 the following is a list of the current and former U.S. presidential candidates:
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush (with Republican presidential candidates)
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush/Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush
Former Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/B-1919191929
That's pretty neat. I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers. We're currently seeing a lot of models that are so large that it doesn't seem feasible to run them locally. But I think there is reason to believe that these models carry a lot of redundancy. Redundancy that could lead to order of magnitude less memory/compute needed.
The trick here will be using large models as data generators to distill some sub task into a web computable model. (I’ve done it a few times for vision rather than text and it’s amazing how potent it is.)
Right! In a lot of cases, just having the synthetic responses plus human filtering for your sub task is enough for less essential tasks. I’m thinking of “procedural” content useful for less sensitive things like games.
It's possible to run a full GPT-3 style language model on any device with 4GB of RAM now, so running models on consumer devices is getting more and more feasible by the day. https://simonwillison.net/2023/Mar/11/llama/
I'm mostly a layman with ML stuff, so I might be doing something wrong, but I've not been impressed with Llama even at higher levels. I've run the 35B model in my home lab and it gave some pretty nonsensical responses. The 13B did better though, so could very well be user error.
There’s this gold rush going on, you’re right, any B without RLHF is meh.
The things getting published as “on device LLM” focus on bitcrushing the lowest B model with minimal RLHF and then pronouncing we have on device LLMs. We’ll definitely get there but signal >>> noise currently.
First person to admit this and write their blog post with A / B tests vs. a Markov chain deserves the gold.
> I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers.
Running ML on the device has been one of Apple's value propositions for a long time. They are currently silent on everything that's unfolding, but I expect them to at least mention something and WWDC (and trying to run that something on the device)
If I understand correctly, there was an all-company invited annual AI day which was silent on recent developments.
But then ~two weeks later there was what seemed like an on-background / press leak about the XDG group that specifically mentioned AI as a current discipline. (Gurman / Bloomberg)
It seems to me that the release of Core ML stable diffusion (mentions itt) is something if a comment in of itself. At least in the read between the lines / hiding in plain sight style of Apple.
The company is unveiling a new and presumably next major computing platform at a quality level only they could possibly deliver.
So the relative quiet / lack of comment may be in deference to the gravity of that work.
That said, these changes are too big to ignore—-we should at least hear language that acknowledges the major developments in AI of late at WWDC and some idea for how Apple is thinking about them.
I feel like that's been the pretty consistent lesson in computing over the past decades. New technologies start out as expensive, exotic, and specialized and become cheap and commonplace over time. The more business value the technology provides, the faster it will happen as well I think.
The models will certainly get better (faster to train, less data needed, smaller param counts, etc.) too, though, just like compilers and software have evolved hugely alongside hardware.
they'll meet in the middle. that's what's already happening, and there will probably be co-processors added into consumer devices that excel specifically at the kind of processing that these models need.
Hi! Creator of the library here. If you change the generation parameters to be greedy (i.e., sample=no and top_k=0), you will get "Bonjour, comment êtes-vous?"
The top_k and sample generation parameters are just there to show that they are supported :), and is sometimes useful for the other tasks (like text generation w/gpt2, to get more variety)
I understand there's reasons the translation is incorrect, but if the very first example you're showing on the page is wrong, most people (who are fluent enough) will just roll their eyes and leave it at that. Maybe showcase an example that works?
I uploaded the Windows XP desktop wallpaper into the image classifier. Just the raw image file. It gave me the labels "monitor", "computer screen", "desktop". "Field", "sky", grass", that kind of thing were nowhere to be found.
I know this is more of a comment on the state of AI models than Transformers.js. It's probably not even representative of state-of-the-art image classifier models. Just a fun example of how these things learn.
Haha very interesting! I assume it's because that type of image is only found on computer screens, so, the model thinks the grass "contributes to it's idea of what a computer screen is".
... and of course, the library only ports those models to the browser; if you train a better model, you can always convert it to the ONNX format, then use it with the library.
Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...
I wouldn't trust them for anything else.
The other models are not better, here's the text generation output from "I enjoy walking my cute dog":
> I enjoy walking with my cute dog, I have been going to the park, and I just happened to like walking with my cute dog. I like to play with the dog.
My dog (Hannah) has been on my way home since December and when she came home she told me to go out and stay back. I told her that she had been too busy. I had to start working and had to go outside and go see myself again.
It could be just an algorithm that generates random sentences that it wouldn't make less sense.
I think it's worth pointing out that the library just gets the models working in the browser. The correctness of the translation is dependent on the model itself.
If you run the model using HuggingFace's python library, you will also get the same results (I've tested it, since, I wasn't too happy with those default translations and generations).
With regards to the text generation output, this is also similar to what you will get from the PyTorch model. Check out this blog post from HuggingFace themselves which discusses this: https://huggingface.co/blog/how-to-generate.
> Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...
Really? For me that gives "Bonjour, comment êtes-vous?" with the default settings.
> text generation output
Yeah, text generation is really something that requires a big model. The Llama 7B param model quantized to 4bit is 13G and that is the smallest model I'd actually attempt to use for unconstrained text generation.
« Bonjour, comment êtes-vous? » barely translates to « Hi, how are you feeling today? » or, depending on the context, to something like « Hi, please describe yourself » to a native French speaker.
This is great. Awesome work. I selected the model for sentiment analysis and changed the prompt . Though it took a while to download the roughly 170MB of model file but I understand it is just a one time thing. And it did the work without crashing. I can imagine this being used in many devices with embedded browser.
Curious if this library can be integrated with WebGPU - there is a recent post on (https://news.ycombinator.com/item?id=35191687) announced that WebGPU can now be used for large models
As I mentioned in another comment, the library just allows the models to be run in the browser. The models generally give the same outputs as if they were run with their PyTorch equivalents, so, the quality can (for the most part) be blamed on the original model.
Also, remember to play around with generation parameters. Some tasks like code completion and speech-to-text work best with greedy sampling (sample=false, top_k=0), while others like text generation work best with random sampling (sample=true, top_k>0)
Looking at the code it seems like it's only running using simd so far. I think the creator said something about the WebGL models being inaccurate when quantized or something.
I'd like to use this transformer model in rust (because it's on the backend, because I can use data munging and it will be faster, and for other reasons). It looks like a good model! But, it doesn't compile on Apple Silicon for wierd linking issues that aren't apparent - https://github.com/guillaume-be/rust-bert/issues/338. I've spent a large part of today and yesterday attempting to find out why. The only other library that I've found for doing this kind of thing programmatically (particularly sentiment analysis) is this (https://github.com/JohnSnowLabs/spark-nlp). Some of the models look a little older, which is OK, but it does mean that I'd have to do this in another language.
Does anyone know of any sentiment analysis software that can be tuned (other than VADER - I'm looking for more along the lines of a transformer model) - like BERT, but is pretrained and can be used in Rust or Python? Otherwise I'll probably using spark-nlp and having to spin another process.
Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!
As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!
Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).
---
To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.
Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.
---
If you want to keep up-to-date with the development, check us out on twitter: https://twitter.com/xenovacom :)