The total memory of the model is less important then the memory needed to comput...

The total memory of the model is less important then the memory needed to compute one batch. I’ve worked with recommendation models used in serving that were 10ish terabytes. The simple trick was most of the memory was embeddings and only small subset of embeddings were needed to do inference for one batch. If you fetch those embeddings as if they were features you can run very large models on normalish compute. You never need to load the entire model to ram at once.

Another trick you can use is load only some layers of the model into ram at a time (with prefetching to minimize stalls).

Or if you are google enjoy that tpus have a silly amount of ram. Tpu pods have a ton of ram.