* BERT used GPT's architecture but trained in a different way. Instead of training a language model, they forced the model predict holes in a text and predicting whether two sentences go one after another. (https://arxiv.org/abs/1810.04805)
* The paper in the top post found out that if we fine tune several models in the same way as in BERT, we get improvement in each of the fine tuned models.
The BERT paper also introduced BERT Base, with is 12 layers with approximately the same number of parameters as GPT, but still outperforms GPT on GLUE.