Yes, they use a pre-trained model, but they do further training (please correct ...

Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).

> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer when the training stages switch.