They came up with the SuperGLUE benchmark because they found that the GLUE benchmark was flawed and too easy to game. There were correlations in the dataset that made it possible to get questions right without real understanding, and so the results didn't generalize.
Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.
Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.