Hacker News new | past | comments | ask | show | jobs | submit login

I would be curious how much less computationally expensive these models are. Full-blown LLMs are overkill for most of the things I do with them. Does running them affect battery life of mobile devices in a major way? This could actually end up saving a ton of electricity. (Or maybe induce even more demand...)



It probably helps that Apple Silicon has dedicated die space to the Neural Engine - essentially a TPU. No good for training, great for inference.


I’ve been reading up on this recently but devs say ANE is kinda a pain in the ass to leverage; most OSS is using gpu instead


these most likely aren't using the Neural Engine

the ANE seemed to be optimised for small vision models like you might run on an iPhone a couple of years ago

these will be running on the GPU


I bet these can all run on ANE. I’ve run gpt2-xl 1.5B on ANE [1] and WhisperKit [2] also runs larger models on it.

The smaller ones (1.1B and below) will be usably fast and with quantization I suspect the 3B one will be as well. GPU will still be faster but power for speed is the trade-off currently.

[1] 7 tokens/sec https://x.com/flat/status/1719696073751400637 [2] https://www.takeargmax.com/blog/whisperkit


indeed, but probably not as written currently?

i.e they would need converting with e.g. your work in more-ane-transformers


I wonder if there could be a tiered system where a "dumber" LLM fields your requests but passes them on to a smarter LLM only if it finds it confidence level below some threshold.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: