Their model is based on BERT, so essentially it's a BERT fine tuned in a novel w...

riku_iki · on Feb 17, 2019

And BERT is OpenAI transformer, finetuned in a novel way, and OpenAI transformer is Tensor2Tensor transformer finetuned in a novel way )

solomatov · on Feb 17, 2019

To summarize the achievements:

* Attention is all you need transformer created a non recurrent architecture for NMT (https://arxiv.org/abs/1706.03762)

* OpenAI GPT modified the original transformer by changing architectutre (one net instead of encoder/decoder pair), and using different hyperparameters which seems to work the best (https://s3-us-west-2.amazonaws.com/openai-assets/research-co...)

* BERT used GPT's architecture but trained in a different way. Instead of training a language model, they forced the model predict holes in a text and predicting whether two sentences go one after another. (https://arxiv.org/abs/1810.04805)

* OpenAI GPT2 achieved a new state of the art in language models (https://d4mucfpksywv.cloudfront.net/better-language-models/l...)

* The paper in the top post found out that if we fine tune several models in the same way as in BERT, we get improvement in each of the fine tuned models.

riku_iki · on Feb 17, 2019

Also:

* OpenAI GPT adapted idea of fine-tuning of language model for specific NLP task, which has been introduced in ELMo model.

* BERT created bigger model (16 layers in GPT vs 24 layers in BERT), proving that larger Transformer models increase performance

phowon · on Feb 17, 2019

The BERT paper also introduced BERT Base, with is 12 layers with approximately the same number of parameters as GPT, but still outperforms GPT on GLUE.

solomatov · on Feb 17, 2019

>OpenAI GPT adapted idea of fine-tuning of language model for specific NLP task, which has been introduced in ELMo model.

Idea of transfer learning of deep representations for NLP tasks was before, but nobody was able to achieve it before ELMo.

If we are pedantic we can include the whole word2vec stuff. It's a shallow transfer learning.

riku_iki · on Feb 17, 2019

Yeah, but in case of ELMo it was fine-tuning (training of pretrained language model and task model together), not just transfer learning.

phowon · on Feb 17, 2019

With ELMo, the pretrained weights are frozen. Only the scalars for ELMo layers are tuned (as well as the additional top-level model, of course).