If these things get put on specialized hardware for inference with much lower en...

lobstey · on April 11, 2022

the biggest problem first of all might be the memory requirements given so many parameters. It couldn't be as cheap as a high end computer in the foreseeable future.

f38zf5vdt · on April 11, 2022

There is probably a space-time trade off that needs to be explored in this space. It might be possible to preload the some of the most likely tokens to be selected next into the cache and/or RAM. These are glorified auto-complete algorithms that are poorly understood, as DeepMind's optimizations appear to show. For the English language, it is probable that there are only so many possible grammatically correct selections for the next token, for example.

visarga · on April 11, 2022

Glorified autocomplete? Autocomplete can guess the next word .. sometimes, GPT-3 goes hundreds of words ahead. On generic topics it can be hard to distinguish from human text.

And it can't cache tokens because all tokens are evaluated in the context of all the other tokens, so they don't have the same representations when they reoccur at different positions.

f38zf5vdt · on April 11, 2022

They're evaluated in the context of the last 2^n many tokens, for many models it is 1024, 2048, or 4096 tokens as a scanning window. The tokens (words and sometimes punctuation) are represented by integer values, so the last 2^n many tokens would certainly qualify for storage in a cache. Then next token selection only has so many possible assignable selections in any given language model because of grammatical limitations. This is only one such optimization, there could also be optimizations around the likelihood of certain words to be used given the presence of certain previous tokens, and so on.

But, yes, tokens are chosen one word as a time based on the previous content, similar to earlier auto-completion algorithms.

cuuupid · on April 11, 2022

I’ve been saying this for years, language models are the ML equivalent of the billionaire space race, it’s just a bunch of orgs with unlimited funding spending millions of dollars on compute to get more parameters than their rivals. It could be decades before we start to see them scale down or make meaningful optimizations. This paper is a good start but I’d be willing to bet everyone will ignore it and continue breaking the bank.

Can you say that about any other task in ML? When Inceptionv3 came out I was able to run the model pretty comfortable on a 1060. Even pix2pix and most GANs fit comfortably in commercial compute, and the top of the line massive models can still run inference on a 3090. It’s so unbelievably ironic that one of the major points Transformers aimed to solve when introduced was the compute inefficiency of recurrent networks, and it’s devolved into “how many TPUs can daddy afford” instead.

native_samples · on April 11, 2022

Is that fair? My Pixel phone seems to run nothing but ML models of various kinds and they run locally which is madness, pure madness. It can recognize songs and my speech without talking to the cloud at all. That's pretty much the definition of optimization!

alpineidyll3 · on April 12, 2022

It's just about where the software development incentives are. Big shops have incentive to have service models. I think of it like a return to the mainframe days, and an old-IBM like mindset.

However the upside to pocket sized intelligence will eventually win out. It's just a question of when someone will scrape together the required investment.

hwers · on April 11, 2022

Imagine any diffusion-style text-to-image model on specialized ASIC hardware.

astrange · on April 11, 2022

That’s what an ANE/TPU is.

If you mean putting the model weights into gates directly, it’d be useless because users would get bored of the model as soon as they figured out what its style looked like. Also, large models can memorize their training data so eventually you’ll get it to output something copyrighted.

alpineidyll3 · on April 13, 2022

These models are definitely entering the space where no one could ever get bored of them, and many styles can be generated.