I wonder how much you can push the "deeper and thinner" part. At some point your entire FFN fits into your L2 cache, you're bound to get some performance jumps.
Other research from Meta FAIR actually suggests that you should prune deeper layers if you want to improve performance while maintaining accuracy [1]. So there must be a cutoff point for smaller networks where this approach still works, otherwise the results are contradictory. Or we could drastically improve these new models even further.