For decent on-device inference, you need *enough* memory of *high-enough bandwid...

For decent on-device inference, you need enough memory of high-enough bandwidth and enough processing power connected to that memory.

The traditional PC architecture is ill-suited to this, which is why for most computers, a GPU (which offers all three in a single package) is currently the best approach... only at the moment, even a 4090 only offers enough memory for moderately-sized models to load and run.

The architecture that supports Apple's ARM computers is (by design or happenstance) far better suited: unified memory maximises the model that can be loaded, some higher-end options have decently-high memory bandwidth, and the architecture lets any of the different processing units access the unified memory. Their weakness is cost, and that the processors aren't powerful enough yet to compete at the top end.

So there's currently an empty middle-ground to be won, and it's interesting to watch and wait to see how it will be won. e.g.

- affordable GPUs with much larger memory for the big models? (i.e. GPUs, but optimised)

- affordable unified memory computers with more processing power (i.e. Apple's approach, but optimised)

- something else (probably on the software side) making larger model inference more efficient, either from a processing side (i.e. faster for the same output) or from a memory utilisation side (e.g. loading or streaming parts of large models to cope with smaller device memory, or smaller models for the same output)