540B parameters means ~1TB of floating bytes (assuming BFLOAT16). Quadruple that...

endisneigh · on April 4, 2022

right - and even if you did run happen to have a machine with 4TB of ram - what type of latency would you have on a single machine running this as a service? how many machines would you need for google translate performance?

doesn't seem like you can run this as a service, yet.

Mehdi2277 · on April 5, 2022

The total memory of the model is less important then the memory needed to compute one batch. I’ve worked with recommendation models used in serving that were 10ish terabytes. The simple trick was most of the memory was embeddings and only small subset of embeddings were needed to do inference for one batch. If you fetch those embeddings as if they were features you can run very large models on normalish compute. You never need to load the entire model to ram at once.

Another trick you can use is load only some layers of the model into ram at a time (with prefetching to minimize stalls).

Or if you are google enjoy that tpus have a silly amount of ram. Tpu pods have a ton of ram.