I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.