Hacker News new | past | comments | ask | show | jobs | submit login
Transformers.js (xenova.github.io)
378 points by skilled on March 16, 2023 | hide | past | favorite | 75 comments



Hi everyone! Creator of Transformers.js here :) ...

Thanks so much to everyone for sharing! It's awesome to see the positive feedback from the community. As you'll see from the demo, everything runs inside the browser!

As of 2023/03/16, the library supports BERT, ALBERT, DistilBERT, T5, T5v1.1, FLAN-T5, GPT2, BART, CodeGen, Whisper, CLIP, Vision Transformer, and VisionEncoderDecoder models, for a variety of tasks including: masked language modelling, text classification, text-to-text generation, translation, summarization, question answering, text generation, automatic speech recognition, image classification, zero-shot image classification, and image-to-text. Of course, we plan to add many more models and tasks in the near future!

Try out some of the other models/tasks from the "Task" dropdown (like the code-completion or speech-to-text demos).

---

To respond to some comments about poor translation/generation quality, many of the models are actually quite old (e.g., T5 is from 2020)... and if you run the same prompt through the PyTorch version of the model, you will get similar outputs. The purpose of the library/project is to bring these models to the browser; we didn't train the models, so, poor quality can (mostly) be blamed on the original model.

Also, be sure to play around with the generation parameters... as with many LLMs, generation parameters matter a lot.

---

If you want to keep up-to-date with the development, check us out on twitter: https://twitter.com/xenovacom :)


Can I use it in Deno? It requires a worker (fails in node because "self")


Yes, there are some workarounds you can do to get it working in non-browser environments. I do aim to get a permanent solution, which will ideally work out-of-the-box for both browser and node/deno environments.

Some other users also reported the issue (which stems from a bug in onnxruntime-web), and we were able to get it working in these cases:

1. https://github.com/xenova/transformers.js/issues/4 2. https://github.com/xenova/transformers.js/issues/19


Thanks, I will be following


Is there an Optimus model yet for Prime number encoding?


Good one


What did Optimus Prime say when he first learned about machine learning? "Autobots, roll out the algorithms!"


I really liked the suggestion that if it takes off, the web should consider trying to expose something like the OpenXLA intermediate model, which powers the new PyTorch 2.0, TensorFlow, Jax, and a bunch of other top tier ML frameworks.

It already is very well optimized for a ton of hardware (cpus, gpus, ml chips). The Intermediate Representation might already be a web-safe-ish model, effectively self-sandboxing, which could make it safe to expose.

https://news.ycombinator.com/item?id=35078410


Shouldn't it be possible to build a WebGL backend for OpenXLA?

Edit: There seems to be some progress on a WASM backend for OpenXLA here: https://github.com/openxla/iree/issues/8327

and a proposed WebML working group at W3C: https://www.w3.org/2023/03/proposed-webmachinelearning-chart... that references OpenXLA


Making each webapp target & optimize ML for every possible device target sounds terrible.

The purpose of MLIR is that most of the optimization can be done at lower levels. Instead of everyone figuring out & deciding on their own how best to target & optimize for js, wasm, webgl, and/or webgpu, you just use the industry standard intermediate representation & let the browser figure out the tradeoffs. If there is inboard hardware, neural cores, they might just work!

Good to see WebML has OpenXLA on their radar... but also a bit afraid, expecting some half ass excuses why of course we're going to make some brand new other thing instead. The web & almost everyone else has such a bad NIH problem. WASI & web file apis being totally different is one example, where there's just no common cause, even though it'd make all the difference. And with ML, the cost of having your own tech versus being able to re-use the work everyone else puts on feels like a near suicidal decision to make an API that will never be good, never perform anywhere where near it could.


> Making each webapp target & optimize ML for every possible device target sounds terrible.

Yes it does.

Did something I said imply that?

OpenXLA is an intermediate layer that frameworks like PyTorch or JAX can use. It has pluggable backends, and so if there was a web-compatible backend (WebGL or WASM) then everyone could use it and all models that were built using something that used OpenXLA[1] would be compatible.

[1] Not 100% sure how low-level the OpenXLA intermediate representation is. I know it's not uncommon when porting a brand new primitive (eg a special kind of transformer etc) to a new architecture (eg CUDA->Apple M1) that some operations aren't yet supported, so this might be similar.


I support having web targets. It'd be a good offering.

But it feels upside down to me from what we really all should want, which is a safe way to let the web target any backend you have. WebGPU or WebGL or wasm are going to be OK targets, but with limited hardware support & tons of constraints that mean they won't perform as well as openxla.

Also how will these targets get profiled? Do we ship the same WebGL to a 600w monster as a rpi?

There's a lot of really good reasons to want OpenXLA under the browser, rather than above/before it.


> WebGPU or WebGL or wasm are going to be OK targets, but with limited hardware support & tons of constraints that mean they won't perform as well as wasm.

I don't understand. "WebGPU or WebGL or wasm".. "won't perform as well as wasm".


*OpenXLA, edited


I don't think a high level representation is necessary for relatively straightforward FMA extensions (either outer products in the case of Apple AMX or matrix products in the case of CUDA/Intel AMX). WebGPU + tensor core support and WASM + AMX support would be simpler to implement, likely more future proof and wouldn't require maintaining a massive layer of abstraction.


The issue is, much of the performance of Pytorch, JAX, et al comes from running a JIT that is tuned to the underlying HW, and come with support for high level intrinsic operations that were either hand-tuned or have extra hardware support, especially ops dealing with parallelizing computation across multiple cores.

You'd probably end up representing these as external library function calls in WASM, but then the WASM JIT would have to be taught that these are magic functions that are potentially treated specially, so at that point you're just embedding HLO ops as library func, and them embedding an HLO translator into the WASM runtime, I'm not sure that's any better.

By analogy would be be better to eliminate fragment and vertex shaders and just use WASM for sending shaders to the GPU, or is the domain specific language and its constraints beneficial to the GPU drivers?


Yes! In short:

Do we leave it to every web app to figure out how best to serve everyone, and have them bundle their own tuning optimizers into each app? Or do we bake in a higher level abstraction that works for everyone that the browser itself will be able to help optimize?

There's some risk & the browser apis likely won't come with all the escape-hatches the full tools might have to manually jigger with optimizations, but the idea of getting everyone to DIY seems like a promise of misfit: way too much code when you don't need it, way not enough tuning when you do need it. And there's other risks; the assurity that oh we just need one or maybe two ops on the web & then everything will be fine forever doesn't wash with me. If we make new ops the old code won't use it.

And what about hardware that doesn't have any presence on the web; lots of cheap embedded cores have a couple tflops of neural coprocessing, but neither wasm nor webgpu can target that atm, it's much too simple a core for that kind of dynamic execution; it's the sea of weird expansive hardware that OpenXLA helps one target (and target very well indeed) that is it's chief capability, and I can't imagine forgoing a middleman abstraction like it.


checkout https://mlc.ai/web-stable-diffusion, which is builds on top of Apache TVM and brings in models from PyTorch2.0, ONNX and other means into the ML compilation flow


Hah, ChatGPT has successfully poisoned the well. Well done sama.

This lib is great work, a JS interface for running HF models. The comments about how "bad" the outputs are as surprising to me as they are alarming.

OAI has now set the zero-effort bar so high that even HNers (who click on .js headlines) fall into the gap they've left. That sucking sound you hear is market share being hoovered up.


your comments are very snarky

It would be great if we all try to keep the tone respectful and avoid snarkiness to maintain a constructive discussion

https://news.ycombinator.com/newsguidelines.html


No they're not mate, it's just you. I've read the guidelines (thanks for helpfully linking them). I see this on HN, people infer offense and cite the book rather than engage.

By not highlighting what you found "snarky" your response is a definitional "shallow dismissal". I see you just "picked the most provocative thing to complain about". Not a lot of being "kind" either.

So you know what would also be great? If you held yourself to the standards you're keen to police around here.


I typed in 1 2 3 4 5 6 in a text generation task with length=500 and got this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4142 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 1 2 3 4 5 6 7 8 9 10 11 12 13 15 15 16 16 18 19 20 21 22 23 24 25 25 26 27 28 29 30 31 32 32 33 34 35 36 37 38 39 41 42 44 45 46 47 48 50 51 53 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 85 86 87 88 89 90 92 93 94 95 97 98 99 100

This is the third time that a candidate has been elected. In this article I will use the names of the candidates and the candidates. In 2016 the following is a list of the current and former U.S. presidential candidates: Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush (with Republican presidential candidates) Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush/Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush Former Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/Bush/B-1919191929


Gonna use this as lyrics, thanks.


That's pretty neat. I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers. We're currently seeing a lot of models that are so large that it doesn't seem feasible to run them locally. But I think there is reason to believe that these models carry a lot of redundancy. Redundancy that could lead to order of magnitude less memory/compute needed.

Or perhaps hardware will catch up before.


The trick here will be using large models as data generators to distill some sub task into a web computable model. (I’ve done it a few times for vision rather than text and it’s amazing how potent it is.)


Right! In a lot of cases, just having the synthetic responses plus human filtering for your sub task is enough for less essential tasks. I’m thinking of “procedural” content useful for less sensitive things like games.


Can you describe the vision bit? I have a general idea but would like to know the details, e.g. which models you used.


It's possible to run a full GPT-3 style language model on any device with 4GB of RAM now, so running models on consumer devices is getting more and more feasible by the day. https://simonwillison.net/2023/Mar/11/llama/


Its possible to run a RLHF tuned Llama 7b model. Whether this is "full GPT-3 style" is up for debate.


I'm mostly a layman with ML stuff, so I might be doing something wrong, but I've not been impressed with Llama even at higher levels. I've run the 35B model in my home lab and it gave some pretty nonsensical responses. The 13B did better though, so could very well be user error.


There’s this gold rush going on, you’re right, any B without RLHF is meh.

The things getting published as “on device LLM” focus on bitcrushing the lowest B model with minimal RLHF and then pronouncing we have on device LLMs. We’ll definitely get there but signal >>> noise currently.

First person to admit this and write their blog post with A / B tests vs. a Markov chain deserves the gold.


Have you tried Alpaca yet? It's a massive improvement on base LLaMA.


> I'm personally wondering in how far ML compute will be done on consumer devices, rather than on servers.

Running ML on the device has been one of Apple's value propositions for a long time. They are currently silent on everything that's unfolding, but I expect them to at least mention something and WWDC (and trying to run that something on the device)


If I understand correctly, there was an all-company invited annual AI day which was silent on recent developments.

But then ~two weeks later there was what seemed like an on-background / press leak about the XDG group that specifically mentioned AI as a current discipline. (Gurman / Bloomberg)

It seems to me that the release of Core ML stable diffusion (mentions itt) is something if a comment in of itself. At least in the read between the lines / hiding in plain sight style of Apple.

The company is unveiling a new and presumably next major computing platform at a quality level only they could possibly deliver.

So the relative quiet / lack of comment may be in deference to the gravity of that work.

That said, these changes are too big to ignore—-we should at least hear language that acknowledges the major developments in AI of late at WWDC and some idea for how Apple is thinking about them.


They’re there, released Core ML Stable Diffusion a couple months ago.


I am not a Swift dev but it seemed like the speed of this release was very fast by Apple standards.

Can anyone in the know confirm that?


> Or perhaps hardware will catch up before.

I feel like that's been the pretty consistent lesson in computing over the past decades. New technologies start out as expensive, exotic, and specialized and become cheap and commonplace over time. The more business value the technology provides, the faster it will happen as well I think.

The models will certainly get better (faster to train, less data needed, smaller param counts, etc.) too, though, just like compilers and software have evolved hugely alongside hardware.


they'll meet in the middle. that's what's already happening, and there will probably be co-processors added into consumer devices that excel specifically at the kind of processing that these models need.


> there will probably be co-processors added into consumer devices that excel specifically at the kind of processing that these models need.

There already are, e.g., Google Edge TPU, Apple Neural Engine, etc.


They don't help with the memory requirements of these LLMs though.


are any of the LLM or image AI like Stable Diffusion fine tuning methods leveraging Apple Neural Engine?

the best I've seen is a renderer leveraging "metal"


Hmm, this works with literal translation, then?

    Hello, how are you?
is literally,

    Bonjour, comment êtes-vous?
But usually you would say,

    Bonjour, comment ça-va?
(Hello, how goes it?)

Which the model likes to translate to,

    Bonjour, comment est-ce faite?
Which no french person would ever say to you because that's a lot of words and doesn't really sound very... French.

And of course are you talking to someone familiar... so on and so forth.


Hi! Creator of the library here. If you change the generation parameters to be greedy (i.e., sample=no and top_k=0), you will get "Bonjour, comment êtes-vous?"

The top_k and sample generation parameters are just there to show that they are supported :), and is sometimes useful for the other tasks (like text generation w/gpt2, to get more variety)


I understand there's reasons the translation is incorrect, but if the very first example you're showing on the page is wrong, most people (who are fluent enough) will just roll their eyes and leave it at that. Maybe showcase an example that works?


I did a couple of tries with simple sentences in French and the results were not great. But it’s still impressive.


I uploaded the Windows XP desktop wallpaper into the image classifier. Just the raw image file. It gave me the labels "monitor", "computer screen", "desktop". "Field", "sky", grass", that kind of thing were nowhere to be found.

I know this is more of a comment on the state of AI models than Transformers.js. It's probably not even representative of state-of-the-art image classifier models. Just a fun example of how these things learn.


Haha very interesting! I assume it's because that type of image is only found on computer screens, so, the model thinks the grass "contributes to it's idea of what a computer screen is".

... and of course, the library only ports those models to the browser; if you train a better model, you can always convert it to the ONNX format, then use it with the library.


Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...

I wouldn't trust them for anything else.

The other models are not better, here's the text generation output from "I enjoy walking my cute dog":

> I enjoy walking with my cute dog, I have been going to the park, and I just happened to like walking with my cute dog. I like to play with the dog. My dog (Hannah) has been on my way home since December and when she came home she told me to go out and stay back. I told her that she had been too busy. I had to start working and had to go outside and go see myself again.

It could be just an algorithm that generates random sentences that it wouldn't make less sense.


Hi there! Creator of Transformers.js here :)

I think it's worth pointing out that the library just gets the models working in the browser. The correctness of the translation is dependent on the model itself.

If you run the model using HuggingFace's python library, you will also get the same results (I've tested it, since, I wasn't too happy with those default translations and generations).

With regards to the text generation output, this is also similar to what you will get from the PyTorch model. Check out this blog post from HuggingFace themselves which discusses this: https://huggingface.co/blog/how-to-generate.


> Even the default example of "Hello, how are you?" from English to French yields an awfully wrong result ("Hello, what is your experience?")...

Really? For me that gives "Bonjour, comment êtes-vous?" with the default settings.

> text generation output

Yeah, text generation is really something that requires a big model. The Llama 7B param model quantized to 4bit is 13G and that is the smallest model I'd actually attempt to use for unconstrained text generation.


> "Bonjour, comment êtes-vous?"

The idiomatic translation here would be "Bonjour, comment allez-vous?"


As shown in the demo video (on GitHub [1], or Twitter [2]), you do get that result sometimes (with randomness)

Using greedy sampling (sample=false and top_k=0) you get "Bonjour, comment êtes-vous?", which appears to be a very direct translation.

As mentioned in one of my previous comments, these inaccuracies also occur in the PyTorch models, and so, it's not the library's fault :')

[1] https://github.com/xenova/transformers.js [2] https://twitter.com/xenovacom/status/1628895478749315073


« Bonjour, comment êtes-vous? » barely translates to « Hi, how are you feeling today? » or, depending on the context, to something like « Hi, please describe yourself » to a native French speaker.


Yes I love being able to run ML models without dealing with Python package management!


I guess single words just don't give it enough context to go on - I got some pretty weird results by just switching the input text to: Hi!

Often it would say "Bonjour", but then it would say things like:

ce sujet, je peux dire tout à fait que les médias sont vraiment un grand plus beau jeu de tas d'élevage dans mon ensemble.

or

Voir le chapitre intitulé “E-Malonie”, à l’adresse : http://www.mythuana.com/index_f.php!

and once simply: o


How does this compare to tensorflow.js [1] ?

[1] https://www.tensorflow.org/js


They both solve different issues. This is a library akin to huggingface transformers, while tensorflow js is akin to tensorflow or pytorch


This project is a wrapper over ONNX-converted pytorch models; ONNX[1] would be a tensorflow.js equivalent with the javascript backend

[1] https://onnxruntime.ai/pytorch


This is great. Awesome work. I selected the model for sentiment analysis and changed the prompt . Though it took a while to download the roughly 170MB of model file but I understand it is just a one time thing. And it did the work without crashing. I can imagine this being used in many devices with embedded browser.


Curious if this library can be integrated with WebGPU - there is a recent post on (https://news.ycombinator.com/item?id=35191687) announced that WebGPU can now be used for large models


Once ONNX runtime releases their WebGPU backend, we will add support for it! :)

It should also be noted that browser support for it isn’t very high at the moment… so, unfortunately, we are stuck with WASM (CPU) for now.


How does this compare to this project: https://github.com/visheratin/web-ai

Does it support using custom ONNX models?


In one hand it's impressive how much it can do and in the other is not useful for anything more than making interesting a character of a videogame?


And also privacy-respecting services.


Are there more accurate models available?

All my tests seem to give poor results. I assume because it has to be a downloadable size?


Here is the full list of available models: https://huggingface.co/Xenova/transformers.js/tree/main/quan...

As I mentioned in another comment, the library just allows the models to be run in the browser. The models generally give the same outputs as if they were run with their PyTorch equivalents, so, the quality can (for the most part) be blamed on the original model.

Also, remember to play around with generation parameters. Some tasks like code completion and speech-to-text work best with greedy sampling (sample=false, top_k=0), while others like text generation work best with random sampling (sample=true, top_k>0)


What's performance like, compared to regular PyTorch (running on CPU)?


This runs on GPU with WebGL, so it will depend on what GPU you have.


Looking at the code it seems like it's only running using simd so far. I think the creator said something about the WebGL models being inaccurate when quantized or something.



Right - currently, everything runs using WASM (32-bit, with 64-bit coming soon [1,2]), and I plan to add support for WebGPU soon!

(WebGPU is the successor to WebGL, which is coming out in April 2023 [3])

[1] https://github.com/WebAssembly/memory64/issues/36#issuecomme... [2] https://groups.google.com/a/chromium.org/g/blink-dev/c/VomzP... [3] https://github.com/microsoft/onnxruntime/issues/11695#issuec...


If I run transformer.js using Node on a machine that has GPU (such as Nvidia Jetson Nano) will it take advantage of the GPU?


Currently the library only runs on the CPU (it's only a few weeks old). WebGPU support is planned though (which releases soon [1])

[1] https://groups.google.com/a/chromium.org/g/blink-dev/c/VomzP...


This isn't a robot in disguise! Imposter!


I'd like to use this transformer model in rust (because it's on the backend, because I can use data munging and it will be faster, and for other reasons). It looks like a good model! But, it doesn't compile on Apple Silicon for wierd linking issues that aren't apparent - https://github.com/guillaume-be/rust-bert/issues/338. I've spent a large part of today and yesterday attempting to find out why. The only other library that I've found for doing this kind of thing programmatically (particularly sentiment analysis) is this (https://github.com/JohnSnowLabs/spark-nlp). Some of the models look a little older, which is OK, but it does mean that I'd have to do this in another language.

Does anyone know of any sentiment analysis software that can be tuned (other than VADER - I'm looking for more along the lines of a transformer model) - like BERT, but is pretrained and can be used in Rust or Python? Otherwise I'll probably using spark-nlp and having to spin another process.

Thanks.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: