I get the impression that you are massively underselling auto-predict. If you to...

hackinthebochs · on Feb 14, 2023

>If you took GPT3's architecture and scaled it down to the size and training set of a typical auto-predict, it would produce near identical results.

This is almost certainly not true. The number of parameters is an important feature related to the quality of output. If you scaled the architecture down significantly, it would be significantly less capable[1]. But perhaps I misunderstand your point.

>Likewise, if we took an auto-predict architecture from 8 years ago, scaled it up to the size of GPT3 and could train it on GPT3's training set, it would produce similar output to GPT3

This is also not true. The transformer is a key piece in the emergent abilities of language models. The difficulties in scaling RNNs are well known. Self-supervised learning is powerful, but it needs to be paired with a flexible architecture to see the kinds of gains we see with LLMs.

Stacked Transformers with self-attention are extremely flexible in finding novel circuits in service to modelling the training data. The question is how to characterize this model in a way that doesn't short-sell what it is doing. Reductively describing it in terms of its training regime is just to treat the resulting model as explanatorily irrelevant. But the complexity and the capabilities are in the information dynamics encoded in the model parameters. The goal is to understand that.

[1] https://arxiv.org/abs/2001.08361

phire · on Feb 15, 2023

> If you scaled the architecture down significantly, it would be significantly less capable[1]. But perhaps I misunderstand your point.

No, my point is that if you scaled a transformer based architecture, down to the equivalent parameter size and training set of a typical 2015 era auto-predict, it would produce near identical results to a 2015 era auto-predict.

> The difficulties in scaling RNNs are well known

The scaling issues in training RNNs are completely irreverent to my point.

Transformers are computationally equivalent to RNNs. It's possible to convert a pre-trained Transformer model into an RNN [1]. There is nothing magical about the Transformer architecture that makes it better at generation.

[1] https://arxiv.org/abs/2103.13076

hackinthebochs · on Feb 15, 2023

>it would produce near identical results to a 2015 era auto-predict.

I don't know that this is true, but it is plausible enough. But the benefit of Transformers is that they are stupid easy to scale. It is in scale that they are able to perform so remarkably across so many domains. Comparing the function of underparameterized versions of the models and concluding that some class of models are functionally equivalent due to their equivalent performance in underparameterized regimes is a mistake. The value of an architecture is in its practical ability to surface functional models. In theory, a MLP with enough parameters can model any function. But in reality, finding the model parameters that solve real world problems becomes increasingly difficult. The inductive biases of Transformers is crucial in allowing it to efficiently find substantial models that provide real solutions. The Transformer architecture is doing real substantial independent work in the successes of current models.