the biggest problem first of all might be the memory requirements given so many parameters. It couldn't be as cheap as a high end computer in the foreseeable future.
There is probably a space-time trade off that needs to be explored in this space. It might be possible to preload the some of the most likely tokens to be selected next into the cache and/or RAM. These are glorified auto-complete algorithms that are poorly understood, as DeepMind's optimizations appear to show. For the English language, it is probable that there are only so many possible grammatically correct selections for the next token, for example.
Glorified autocomplete? Autocomplete can guess the next word .. sometimes, GPT-3 goes hundreds of words ahead. On generic topics it can be hard to distinguish from human text.
And it can't cache tokens because all tokens are evaluated in the context of all the other tokens, so they don't have the same representations when they reoccur at different positions.
They're evaluated in the context of the last 2^n many tokens, for many models it is 1024, 2048, or 4096 tokens as a scanning window. The tokens (words and sometimes punctuation) are represented by integer values, so the last 2^n many tokens would certainly qualify for storage in a cache. Then next token selection only has so many possible assignable selections in any given language model because of grammatical limitations. This is only one such optimization, there could also be optimizations around the likelihood of certain words to be used given the presence of certain previous tokens, and so on.
But, yes, tokens are chosen one word as a time based on the previous content, similar to earlier auto-completion algorithms.
I’ve been saying this for years, language models are the ML equivalent of the billionaire space race, it’s just a bunch of orgs with unlimited funding spending millions of dollars on compute to get more parameters than their rivals. It could be decades before we start to see them scale down or make meaningful optimizations. This paper is a good start but I’d be willing to bet everyone will ignore it and continue breaking the bank.
Can you say that about any other task in ML? When Inceptionv3 came out I was able to run the model pretty comfortable on a 1060. Even pix2pix and most GANs fit comfortably in commercial compute, and the top of the line massive models can still run inference on a 3090. It’s so unbelievably ironic that one of the major points Transformers aimed to solve when introduced was the compute inefficiency of recurrent networks, and it’s devolved into “how many TPUs can daddy afford” instead.
Is that fair? My Pixel phone seems to run nothing but ML models of various kinds and they run locally which is madness, pure madness. It can recognize songs and my speech without talking to the cloud at all. That's pretty much the definition of optimization!
It's just about where the software development incentives are. Big shops have incentive to have service models. I think of it like a return to the mainframe days, and an old-IBM like mindset.
However the upside to pocket sized intelligence will eventually win out. It's just a question of when someone will scrape together the required investment.
If you mean putting the model weights into gates directly, it’d be useless because users would get bored of the model as soon as they figured out what its style looked like. Also, large models can memorize their training data so eventually you’ll get it to output something copyrighted.