This statement is meaningless without controlling for model complexity and data type. For their simplicity, ANNs generalize well on a wide variety of data. GPT-3 yields almost human-level generalization ability for some tasks.
I also clarified that the generalization is probably not far. There is not much complex "realism" to be found in low-to-mid-level features; they're almost mathematical in their simplicity, similar to basis functions.
>> GPT-3 yields almost human-level generalization ability for some tasks.
That's an extravagant claim. There's no machine learning or other system, or algorithm, or technique, that can approach the ability of humans to "generalise" in any task, no matter how you want to define "generalisation". The models built by ANNs in particular are shallow and over-specialised and have none of the depth or complexity of whatever "models" of the world and the entities in the world that humans build in our heads.
Evaluations that show "superhuman" ability are poorly designed. Machine learning research is following benchmarks and metrics that mean nothing and show nothing, beyond the ability to beat said benchmarks with said metrics, which is then blithely taken to mean "progress" towards the approximation of human intelligence. This then leads to hyperbole like in your comment.
> no matter how you want to define "generalisation"
... and how you want to define "task". For some prompts/"tasks", GPT-3 does generate impressive (more than trivial) outputs that cannot be found on the internet and that are indistinguishable from what a human would respond, so it generalizes in that sense. Maybe human ingenuity and generalization is also just slightly perturbed interpolation? It is very difficult to produce something truly novel, so we are also rather tightly limited by prior experience. Who knows? Also, who cares if submarines swim? Anyhow, it seems 50% of the internet is bikeshedding about definitions.
Language generation is a very good example of a task that is very hard to evaluate with any degree of objectivity and for which there are no good metrics.
So, suppose you say that a particular bit of text generated by GPT-3 is "indistinguishable from what a human would respond". If I say it isn't, how can we decide who is right in a way that we can both agree on?
And that's all before we try to figure out "generalisation".
In the Turing Test you have one human judge. I'm asking what happens when two humans disagree about the human-ness of some automatically generated text.
This statement is meaningless without controlling for model complexity and data type. For their simplicity, ANNs generalize well on a wide variety of data. GPT-3 yields almost human-level generalization ability for some tasks.
I also clarified that the generalization is probably not far. There is not much complex "realism" to be found in low-to-mid-level features; they're almost mathematical in their simplicity, similar to basis functions.