Hacker News new | past | comments | ask | show | jobs | submit login

Per instance (worker serving an API request) it requires 8x GPUs. I believe they have thousands of these instances and they scale them up with load.

Because the model isn't dynamic (it doesn't learn) it is stateless and can be scaled elastically.




Ah okay, that makes a lot more sense thank you!


I expect some level of caching and even request bucketing by similarity is possible.

How many users come with the same prompt?


In my experience running the same prompt always get's different results. Maybe they cache between different people but I'm not sure that'd be worth the cache space at that point? although 8x A100s is a lot to not have caching...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: