It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, ...

belter · on Jan 11, 2023

Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

snerbles · on Jan 11, 2023

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.

belter · on Jan 11, 2023

"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

acetabulum · on Jan 11, 2023

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...

bobbyi · on Jan 11, 2023

If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).

weird-eye-issue · on Jan 11, 2023

After training don't you have to keep it running if you want to use it?

wongarsu · on Jan 11, 2023

Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.

weird-eye-issue · on Jan 12, 2023

Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?

bilsbie · on Jan 11, 2023

What’s required to run the model?

wongarsu · on Jan 11, 2023

The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080

renewiltord · on Jan 11, 2023

What's the largest language model I can run on a 3090 with 24 GiB RAM?

lossolo · on Jan 11, 2023

Depends on precision, you can run ~5B model with fp32 precision or ~11B fp16 model max. Int8 is really bad for real world use case so not mentioning it.

But if you are looking to get performance of ChatGPT or GPT-3 then don't waste your time, all GPT-3 like small LLM models (below at least 60B params) are useless for any real world use case, they are just toys.

haldujai · on Jan 11, 2023

If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Fortunately very few real world use cases need to be this general.

If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

lossolo · on Jan 11, 2023

> If you specifically mean a general LLM trained on a general language corpus with instruction finetuning this is correct.

Yes, thanks, that's what I meant.

> If you are training a LLM on a domain specific corpus or finetuning on specific downstream tasks even relatively tiny models at 330m params are definitely useful and not “toys” and can be used to accurately perform tasks such as semantic text search, document summarization and named entity recognition.

Agree, BERT family is a good example here.

renewiltord · on Jan 11, 2023

Okay, thank you. Perfect response.