I would be curious how much less computationally expensive these models are. Ful...

philjohn · 2024-04-26T08:40:41 1714120841

It probably helps that Apple Silicon has dedicated die space to the Neural Engine - essentially a TPU. No good for training, great for inference.

davedx · 2024-04-26T11:16:59 1714130219

I’ve been reading up on this recently but devs say ANE is kinda a pain in the ass to leverage; most OSS is using gpu instead

anentropic · 2024-04-26T11:59:16 1714132756

these most likely aren't using the Neural Engine

the ANE seemed to be optimised for small vision models like you might run on an iPhone a couple of years ago

these will be running on the GPU

smpanaro · 2024-04-26T13:09:07 1714136947

I bet these can all run on ANE. I’ve run gpt2-xl 1.5B on ANE [1] and WhisperKit [2] also runs larger models on it.

The smaller ones (1.1B and below) will be usably fast and with quantization I suspect the 3B one will be as well. GPU will still be faster but power for speed is the trade-off currently.

[1] 7 tokens/sec https://x.com/flat/status/1719696073751400637 [2] https://www.takeargmax.com/blog/whisperkit

anentropic · 2024-04-26T17:08:27 1714151307

indeed, but probably not as written currently?

i.e they would need converting with e.g. your work in more-ane-transformers

JKCalhoun · 2024-04-26T12:46:49 1714135609

I wonder if there could be a tiered system where a "dumber" LLM fields your requests but passes them on to a smarter LLM only if it finds it confidence level below some threshold.