I don't see why on-device inference is the future. For consumers, only a small set of use cases cannot tolerate the increased latency. Corporate customers will be satisfied if the model can be hosted within their borders. Pooling compute is less wasteful overall as a collective strategy.
This argument can really only meet its tipping point when massive models no longer offer a gotta-have-it difference vs smaller models.
On-device inference will succeed the way Linux does: It is "free" in that it only requires the user to acquire a model to run vs. paying for processing. It protects privacy, and it doesn't require internet. It may not take over for all users, but it will be around.
This assumes that openly developed (or at least weight-available) models are available for free, and continue being improved.
This argument can really only meet its tipping point when massive models no longer offer a gotta-have-it difference vs smaller models.