Great question. I wish they said how long the 10 epochs took, so we could figure out the cost (or better, just posted the time and cost together):
"For the 7B and 13B models, we used 16xA10Gs, and for the 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances). When using Ray, there's no need to secure A100s to perform full-parameter fine-tuning on these models! The process is simply repeated for each task. Figures below show an example run based on a context length of 512, with a total of 3.7M effective tokens per epoch on GSM8k dataset.
We ran the training for a maximum of 10 epochs and selected the best checkpoint according to the minimum perplexity score on the validation set."
"For the 7B and 13B models, we used 16xA10Gs, and for the 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances). When using Ray, there's no need to secure A100s to perform full-parameter fine-tuning on these models! The process is simply repeated for each task. Figures below show an example run based on a context length of 512, with a total of 3.7M effective tokens per epoch on GSM8k dataset.
We ran the training for a maximum of 10 epochs and selected the best checkpoint according to the minimum perplexity score on the validation set."