I would be curious how much less computationally expensive these models are. Full-blown LLMs are overkill for most of the things I do with them. Does running them affect battery life of mobile devices in a major way? This could actually end up saving a ton of electricity. (Or maybe induce even more demand...)
I bet these can all run on ANE. I’ve run gpt2-xl 1.5B on ANE [1] and WhisperKit [2] also runs larger models on it.
The smaller ones (1.1B and below) will be usably fast and with quantization I suspect the 3B one will be as well. GPU will still be faster but power for speed is the trade-off currently.
I wonder if there could be a tiered system where a "dumber" LLM fields your requests but passes them on to a smarter LLM only if it finds it confidence level below some threshold.