Hacker News new | past | comments | ask | show | jobs | submit login

This is super cool, but unfortunately it also seems super impractical. Models tend to be quite large, so even if a browser can run them, getting them to the browser involves either:

1. Large downloads on every visit to a website.

2. Large downloads and high storage consumption for each website using large models. (150 websites x 800 MB models => 120 GB of storage used)

Both of those options seem terrible.

I think it might make sense for browsers to ship with some models built in and be exposed via standardized web APIs in the future, but I haven't heard of any efforts to make that happen yet.




We’ve put out a ton of demos that use much smaller models (10-60 MB), including:

- (44MB) In-browser background removal: https://huggingface.co/spaces/Xenova/remove-background-web. (We also put out a WebGPU version: https://huggingface.co/spaces/Xenova/remove-background-webgp...).

- (51MB) Whisper Web for automatic speech recognition: https://huggingface.co/spaces/Xenova/whisper-web (just select the quantized version in settings).

- (28MB) Depth Anything Web for monocular depth estimation: https://huggingface.co/spaces/Xenova/depth-anything-web

- (14MB) Segment Anything Web for image segmentation: https://huggingface.co/spaces/Xenova/segment-anything-web

- (20MB) Doodle Dash, an ML-powered sketch detection game: https://huggingface.co/spaces/Xenova/doodle-dash

… and many many more! Check out the Transformers.js demos collection for some others: https://huggingface.co/collections/Xenova/transformersjs-dem....

Models are cached on a per-domain basis (using the Web Cache API), meaning you don’t need to re-download the model on every page load. If you would like to persist the model across domains, you can create browser extensions with the library! :)

As for your last point, there are efforts underway, but nothing I can speak about yet!


Why is only one of them on WebGPU? Is it because there additional tricky steps required to make a model work on WebGPU, or is there a limitation on what ops are supported there?

I'm keen to do more stuff with WebGPU, so very interested to learn about challenges and limitations here.


We have some other WebGPU demos, including:

- WebGPU embedding benchmark: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchm...

- Real-time object detection: https://huggingface.co/spaces/Xenova/webgpu-video-object-det...

- Real-time background removal: https://huggingface.co/spaces/Xenova/webgpu-video-background...

- WebGPU depth estimation: https://huggingface.co/spaces/Xenova/webgpu-depth-anything

- Image background removal: https://huggingface.co/spaces/Xenova/remove-background-webgp...

You can follow the progress for full WebGPU support in the v3 development branch (https://github.com/xenova/transformers.js/pull/545).

To answer your question, while there are certain ops missing, the main limitation at the moment is for models with decoders... which are not very fast (yet) due to inefficient buffer reuse and many redundant copies between CPU and GPU. We're working closely with the ORT team to fix these issues though!


Thank you for the reply. Seems like all of the links are down at the moment, but it does sound a bit more feasible for some applications than I had assumed.

Really glad to hear the last part. Some of the new capabilities seem fundamental enough that they ought to be in browsers, in my opinion.


Odd, the links seem to work for me. What error do you see? Can you try on a different network (e.g., mobile)?


Error is "xenova-segment-anything-web.static.hf.space unexpectedly closed the connection."

Works on mobile network, though, so might just be my internet connection.


Basically the same problem that's plagued games on the web ever since the first Unreal/Unity asmjs demos a decade ago, and pretty much no progress has been made towards a solution in that time. You just can't practically make a web app which needs gigs of data on the client because there's no reliable way to make sure it stays cached for as long as the user wants it to, and as you say, even if you could reliably cache it the download and storage would still be duplicated per site using the same model due to browsers cache partitioning policies.


Been beating this drum for years. Wrote this 9 years ago! https://max.io/articles/the-state-of-state-in-the-browser/



Actually more like the first Java Applets, Flash, and initially Unreal targeted Flash on the Web, and PNaCL, before asmjs came to be.

The Unreal 3 demo using Flash is still on YouTube.

And this is why most game studios are playing wait-and-see with streaming instead, proper native 3D APIs, with easier to debug tooling (Web still hasn't anything better than SpectorJS), and big size assets.


The File System Access API seems promising.


I'm not sure more APIs are the solution, LocalStorage could already theoretically fill the role of a persistent large data store if browsers didn't cap the storage at 5-10MB for UX reasons. Removing that cap would require user facing changes to allow them to manage the storage used by sites and clean it up it manually when it inevitably gets bloated. Any new API which lets sites save stuff on the client is going to have the same issue.


>Any new API which lets sites save stuff on the client is going to have the same issue.

I don't think it would have the same issues, because the files could be stored in a user specified location outside the browser's own storage area.

Browser vendors can't just delete stuff that may be used by other software on a user's system. And they cannot put a cap on it either, because users can store whatever they like in those directories, bypassing the browser entirely.

But I have never used this API, so maybe I misunderstand how it's supposed to work.


If that's how it works then it would avoid the problem I mentioned, but the UX around using that to cache data internal to the site implementation sounds pretty terrible. You click on "Listen to this article" on a webpage and it opens a file chooser expecting you to open t2s-hq-fast-en+pl.model if you already have it? Users won't be able to make any sense of that.


The API (or at least the Chrome implementation) appears to be unfinished, but the plan seems to be to eventually support persistent directory permissions.

So the web app could ask the user to pick a directory for model storage and henceforth store and load models from there without further interaction.


This is pure free-association: models are below 80 MB, the rest are LLMs and aren't in scope. Whisper is 40 MB, embeddings are 23 MB. (n.b. parts of original comment that actively disclaim understanding: "seems super impractical. Models tend to be quite large...150 websites x 800 MB models")


Some of the models are quite small and worth doing on-device vs the opposite of sending all the data to the server to process. The other huge benefit here is that transformers run in node.js and getting things running is way easier than trying to get some odd combination of python snd its dependencies to work


It's an inherent problem with on-device AI processing, not just in the browser. I think this will only get better when operating systems start to preinstall models and provide an API that browser vendors can use as well.

Even then I think cloud hosted models will probably always be far better for most tasks.


This specific problem is certainly not one for all on-devices AI processing. As someone else mentioned, there are unique UX and browser constraints that come from serving large compute intensive binary blobs through the browser (that are almost identically shared by games).

Separately, having to rely on preinstallation very likely means stagnating on overly sanitized poorly done official instruction-tunes. With the exception of mixtral7x8, the trend has been the community overtime arrives at finetunes which far eclipse official ones.


> I think cloud hosted models will probably always be far better for most tasks

It might depend on just how good you need it to be. There are lots of use-cases where an LLM like GPT 3.5 might be "good enough" such that a better model won't be so noticeable.

Cloud models will likely have the advantage of being more cutting-edge, but running "good enough" models locally will probably be more economical.


I agree. The economic advantages of a hybrid approach could be very significant.


Apple's future is predicated on local machine learning instead of cloud machine learning. They're betting big on it, and you can see the chess pieces being moved into place. They desperately do not want to become a thin client for magical cloud AI.

I'd look to see Apple doing some stuff here.


Not using transformers, but we do object detection in the browser with small quantized yolo models that are about 7mb and run at 30+ fps on modern laptops via tensorflow.js and onnxruntime-web.

Lots of cool demos and real world applications you can build with it. Eg we powered an AR card ID feature for Magic: The Gathering, built a scavenger hunt for SXSW, a test proctoring assistant (to warn you if you’re likely to get DQ’d for eg wearing headphones), and a pill counter for pharmacists. Really powerful for distribution to not make users install an app or need anything other than their smartphone.


If they are single files or directories they could be drag dropped on use. Not very convenient though.

Maybe just some sort of api to give the website fine grained access to the filesystem might be enough. You'd specify a directory or single file the website can read from at any time.

However at some point you will have to download large files. I feel when done implicitly it's bad user experience.

On top of that the developer should implement a robust downloading system that can resume downloads, check for validity, etc. Developers rarerly bother with this, so the user experience is that it sucks.



Still requires drag&drop on most browsers because the File/DirectoryPicker API isn't universally supported.


Origin private file system is supported in all modern browsers. That does make sharing the models between origins difficult at best, but for one origin works fine.

And in any case it's easier to direct users to install Chrome (or preferably Chromium) or instruct drag&drop than doing the brittle and error and bitrot prone virtualenv-pip-docker-git song and dance.


This is why I was hoping the startup MightyApp would succeed. Then it would be practical to build web apps that operated on GB/TB of data in a single tab. Most of the time people would use their normal browser, but for big data jobs you would use your Mighty browser with persistence and unlimited RAM streamed from the cloud. A path to get the best of web apps with the power of native apps. Glad they gave it a shot. Definitely was an idea worth trying.


Browsers can store stuff that's downloaded. Using e.g. the Filesystem API. These files can be accessed from multiple websites. Browser applications can run offline with service workers.

Js/browser based solutions seem to be very often knee-jerk dismissed based on decade old understanding of browser capabilities.


This is probably a really stupid question, but can the models be streamed as they're being ran? So that you as the browser wouldn't need to wait for the entire download first? Or there is even the concept of ordered model transformers?

As I ask it seems wrong to me but just to confirm?


Usually the inference time is small compared with download time so even if this were technically feasible you wouldn’t save much time.

For reference I have a 31mb vision transformer I run in my browser. Building the inputs, running inference, and parsing the response takes less than half a second.


> Usually the inference time is small compared with download time so even if this were technically feasible you wouldn’t save much time.

I can understand that but where time is not a factor and solely a question of data, can a model be streamed?


LLMs like ChatGPT only generate one token at a time. To generate more you run inference repeatedly until you reach a stop token or some other predetermined limit.

I don't see streaming helping anything besides maybe Time-To-First-Inference, but regardless, you're still not getting any output until the entire weights are downloaded.


Might make more sense for web apps and electron applications.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: