The paper you linked claims on page 10 that machines have been performing comparably on the task since 2012, so I'm not sure exactly what the paper is supposed to show in this context.
Am I to conclude that we've had a comparably intelligent machine since 2012?
Given the similar performance between GPT4 and O1 on this task, I wonder if GPT3.5 is significantly better than a human, too.
Sorry if my thoughts are a bit scattered, but it feels like that benchmark shows how good statistical methods are in general, not that LLMs are better reasoners.
You've probably read and understood more than me, so I'm happy for you to clarify.
The figure also shows that the non LLM algorithm from 2012 was as capable or more capable than a human: was it as intelligent as a well educated human?
If not, why is the study sufficient evidence for the LLM, but not sufficient evidence for the previous system?
Again, it feels like statistical methods are winning out in general.
> Perhaps it’s better that you ask a statistician you trust
Maybe we can shortcut this conversation by each of us simply consulting O1 :^)
1) It’s an example of a domain an LLM can do better than humans. A 2012 system was not able to do myriad other things LLMs can do and thus not qualified as general intelligence.
2) As mentioned in the chart label, earlier systems require manual symptom extraction.
3) An important point well articulated by a cancer genomics faculty member at Harvard:
“….Now, back to today: The newest generation of generative deep learning models (genAI) is different.
For cancer data, the reason these models hold so much potential is exactly the reason why they were not preferred in the first place: they make almost no explicit data assumptions.
These models are excellent at learning whatever implicit distribution from the data they are trained on
Such distributions don’t need to be explainable. Nor do they even need to be specified
When presented with tons of data, these models can just learn, internalize & understand…..”
Am I to conclude that we've had a comparably intelligent machine since 2012?
Given the similar performance between GPT4 and O1 on this task, I wonder if GPT3.5 is significantly better than a human, too.
Sorry if my thoughts are a bit scattered, but it feels like that benchmark shows how good statistical methods are in general, not that LLMs are better reasoners.
You've probably read and understood more than me, so I'm happy for you to clarify.