I fully agree with this. As a data scientist, I always think that this is a "nat...

I fully agree with this. As a data scientist, I always think that this is a "natural" consequence of one of the main (if not _the_ main) metric used to evaluate machine translation algorithms, which is BLEU: https://en.wikipedia.org/wiki/BLEU

According to this metric, if you have a moderately long sentence like "I am not the person who said the president should be reelected" and your translation missed the "not", you would still get a score of 11/12 ~ 92%. And, as far as I know, word order doesn't even matter, so "I am the person who said the president should not be reelected", while wrong, would get a perfect score.

Of course these are rather artificial examples, and in general machine translation algorithms and their evaluation work because it's "easier" to create an algorithm that gets the right translation than one that, unintentionally, fools the metric systematically. Nevertheless if the research community used a metric that punished this kind of mistakes more strongly, I suspect that over time a few new algorithms could come up that improve on this specific point.

Alas, I don't know of any such metric (nor I would know how to design one, of course, otherwise I'd publish it ;-) ).