(I work in this field, although not specifically on benchmarking)
I think that this article makes a good point, and correctly identifies weaknesses.
However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.
I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.
I think that this article makes a good point, and correctly identifies weaknesses.
However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.
I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.