I didn't know anything about SuperGLUE before (turns out it's a benchmark for language understanding tasks), so I clicked around their site where they show different examples of the tasks.
One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)
One of their examples, though, didn't make any sense to me:
1. The pilot managed to land the airplane safely
2. The enemy landed several of our aircrafts
It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.
My mother got a perfect 800 score on the GRE English test many years ago when she wanted to go back to graduate school after her children were grown up enough (highschool/college age).
She told me that the way she got her perfect score was by realizing when the questions were wrong and thinking of what answer the test creators believed to be correct.
She had to outguess the test creators and answer the questions wrong -- in the "right" way.
I've had the 'pleasure' of taking some 'Microsoft certifications' at various companies I worked at in the past and this sounds extremely familiar.
"I probably won't ever do it like that and/or there's a syntax error in all four of the answers... but this is the answer you want to hear. It's wrong, mind you, but it's what you want to hear."
Reminds me of the 1 question I got "wrong" on a DOS test (years ago) at TAFE.
The question was "How do you delete all files in the current directory?". Using DOS 6.22 (I think, it's from memory).
My answer "del." was marked incorrect. Because the teacher didn't know enough about DOS to understand that's the standard shortcut for "del .". And the teacher refused to even try out the command, lets alone fix the incorrect mark. sigh
It's not always insanity, sometimes just sub-optimal / way over-engineered in my opinion.
They're getting better at it though. More recently I've done their devops certification and it looks like they're recommending somewhat more sane practices now...
There were still questions where even after three or four tries at certification / reading up on whatever Microsoft thinks is 'good' we didn't find 'the correct answer' according to Microsoft though... ¯\_(ツ)_/¯
I'm a spatial thinker, and I got a similar problem, I see all answers as correct. Eg. which one follows this sequence, and I can find a pattern to all alternatives. And I have to figure out which option the test author think is correct.
Back when I took the 'C# certification' (70-483 I think?) there were multiple questions in the style 'which of the following answers will make the program compile', where all four answers had a syntax error, or the program had a syntax error at a different line that would cause an issue regardless of your answer.
I tried the dispute process but it's basically impossible to dispute / report broken questions unless you have a photographic memory.
I have achieved similar results by similar means in both English and certain other subjects wherein one would assume a “true academic” would “know better” (picking out Sin[x]=2 as being “evidence of error in prior working” when x could merely be Complex, or marking “f[f[n]]=-n as “unsolvable” when it’s just requires a bit of lateral thinking). This always depresses me, like when (as a Brit) I hear Americans say “I could care less” as an indicator of disregard, when actually that indicates they are somewhere above the point of minimal regard.
I think this is really interesting, because "the enemy landed several of our aircraft(s)" is the sort of sentence I'd have hauled a student up for using as a teacher, because 1) it's a none standard, arguably incorrect usage they've used either because they're a none native speaker or because they're trying to be clever and failing, and 2) because the plural of aircraft is aircraft. Nevertheless the author of this sentence almost certainly meant land to mean something different (shot down) than the author of the first, and we can infer the author's intended meaning despite the none standard usage.
This poorly written sentence is the sort of thing you see all the time in the real world, especially from none native speakers, children, and people writing about a topic outside their expertise. If a program can spot the difference in the usage of the word land between these two sentences and infer what the intended meaning in the second sentence is, then it's doing pretty well. Just inferring that land is used to mean something different in the two sentences is less impressive but still pretty cool and I'm not sure which claim is being made.
If you teach others English, please learn the difference between "none" and "non". You mean "non-standard" in all your examples here (if British) or perhaps "nonstandard" (if American).
I would have assumed the second used the term landed to mean acquired. But only after being told that it’s meaning is supposed to be different from the first. With no other context from those two sentences, I’d have guessed #2 Meant land the same way as #1
One other point: I’ve never heard the term “landed” to mean “grounded”, which is maybe the actual intent of #2, but maybe the ai sentence generation is off.....
The example directly below that: "Justify the margins" and "The end justifies the means" is the one I find dubious. Obviously the former could mean to format a document, but those exact words in that structure could be a demand for someone to justify a financial margin for example. It is both true and false depending on the context.
It sounds like you're talking about garden-path sentences [0], and in particular: "time flies like an arrow; fruit flies like a banana" [1]. These are sentences whose structure tricks the reader into making an incorrect parse. My favourite of these has always been: "The horse raced past the barn fell".
I've always enjoyed the multiple valid parses of "Time flies like an arrow". I can't wait for AI to generate more Escher sentences like "More people have been to Russia than I have" ( https://en.m.wikipedia.org/wiki/Comparative_illusion )
You know, I only just now got the second interpretation of that sentence. I always thought of it like "Time flies like an arrow (straight and in one direction), Fruit flies like a banana (when thrown)"
"The horse raced past the barn fell, which has been haunted since all those teenagers were murdered there."
(Noun-adjective is a rare formation, but amusingly more common in the same situations where the author uses rare and archaic definitions like the adjective "fell".)
"I eat my rice with butter." could mean that you use butter as a utensil to eat your rice with. There is often an unlikely way of parsing the sentence that gives an alternate meaning. The point is to test the computer to see if it can distinguish the likely parse from an unlikely one.
These aren't really alternate _parses_ though (in the sense that they don't give different parse trees). They do highlight the different possible meanings of "with" though.
I think "I eat my rice with chicken" vs "I eat my rice with children" vs "I eat my rice with chopsticks" is the canonical example here.
There's a whole field in NLP involved in showing what changes happen to entities mentioned in a sentence as a a side effect of the sentence, and this example shows it pretty well.
I think it's more clear if you say "I usually eat X with Y", i.e. Y it's either the company, the tool or the condiment that you eat with (contrasted with "I'm eating my X", where X is a dish like "rice with chicken")
Not to mention something that almost all NLP systems are resounding terrible at - short-term memory. If we've been talking about corporate financials for an hour and I say 'Justify the margins', it should be crystal clear what I mean. But most automated systems try to operate without a hint of memory or 'state' being tracked.
I'm guessing this is intentional. To a human, although this could be somebody being asked to justify their financial margins that's not a very likely answer. The human can easily see that, while it's possible they're the same meaning, given the lack of any other context the answer is that they're not.
The enemy could have landed several of our aircraft on one of their runways. Agassi may have beaten Becker over the head with his tennis racket. I suspect part of the test is that there can be other meanings that do technically work.
Would a native English speaker use the word "landed" in this way? In the context of aircraft? "Landed" is badly ambiguous here and several distinct meanings are plausible. Captured is the most natural word given your interpretation.
Honestly that sentence -- the use of landed and that awful plural -- approaches engrish. Is that deliberate or is the use of English here just badly flawed? I can't see any other possibilities.
There are a lot of native English speakers in the world and not all of them use the same idioms that you do. This seems like perfectly valid English to me; some other words that could be used instead of “landed” in the aircraft sentence include “bagged”, “nabbed”, “poached”, “got” and “did in”. One of the entertaining aspects of English is the multitude of ways it can be used.
Those are all good synonyms for "got" in the context of shooting at things. But none of the others already has a strong meaning in the context of aircraft, and this other meaning does create some confusion, which is why many speakers would avoid it (if thinking clearly).
I wouldn't use it that way myself, but at the same time the intended meaning is clear as day to me from the context. I'm surprised by the reactions. "Enemy" should give it away immediately.
I'm surprised too. This algorithm is about understanding language, and surely that includes understanding the intended usage. This is something humans have to do all the time. So what if there isn't a formally archived consensus on the definition of "landed" as used in the example. The intended meaning is clear, and so hats off to the algorithm for rolling with it, that is in my mind the fundamental goal of understanding language.
It's more or less impressive depending on whether the algorithm already ate a dictionary; then it's the difference between inferring from context, as people do, and simply knowing all of the known unconventional usages in a very inhuman way.
I don't know. I guess I understood the sentence with 'landed' the same as I would have if someone told me that they'd 'landed a big job'. I wouldn't really say this myself though, although I hear people say 'landed a big catch' when they're talking about fishing.
I don't think anyone would use that particular construction, unless it's some weird dialect of pilot-speak or argot among anti-aircraft folk that I'm not aware of. It's just really awkward and unnatural. Possibly correct, but not the way that anybody actually talks.
Possibly, you could say the planes were landed, as in forced to stay on the ground (because of damage, fear of enemy fire, or damage to the runway). But grounded would be better.
Or just average. There's contextual dependencies in most speech, and (as displayed in this subthread) not every speaker of a language has the same context. It's a fallacy to think that if you lack context for one of the examples, you will automatically score less than average -- other people may miss context for things obvious to you.
If taking the "captured" interpretation, I think it could be reasonably inferred that they successfully landed the aircraft at an airfield afterwards (same meaning). This was my initial read of it and it does not seem strange to me on reflection.
I would like also to point out that even if we do interpret the second as meaning "destroyed", the first could then be interpreted as a combat aviator shooting down an opposing aircraft, bringing us back to the same meaning. Or perhaps both of my interpretations are correct and the meanings are different...
What this tells me is that the benchmark is not very useful.
The benchmark is useful primarily because it puts humans and computers on a level playing field. Human readers will misinterpret written language, and human writers will poorly represent concepts.
The propensity to make mistakes in comprehension is unavoidable, humans only approach 90% accuracy, and computers are getting close to the same level of accuracy on the same base materials as humans.
The other way of testing would be to devise a test where there is only a single interpretation, where the context is clear, and there is no ambiguity in meaning. In that case a competent human and computer algorithm could be expected to answer all questions perfectly.
The purpose of this benchmark on the other hand is to test comprehension when meaning is not explicit and context clues are implied, something humans have had the advantage at over computers until quite recently. The computer won't be 100% accurate, but that's not the purpose of this test.
Aircraft typically get captured on the ground, or get forced to land by threat of being shot down. “Landed”, for me, would require the enemy to actively land the plane, just as “landing a fish” requires both the fisherman’s action and moving the fish from water to land.
I also wouldn’t use “landed” for destroying an enemy plane (neither by shooting it down nor by destroying it on the ground)
That, realistically, leaves hacking the plane’s electronics and then directing it to one’s own airfield.
Yes -- if the sentence had been "grounded the aircraft", then the meaning is obvious. But even though "land" is a synonym for "ground" I don't think there's an equivalence of meaning here. I'm struggling to find a sense in which "landing and enemy aircraft" is a meaningful concept short of jumping out of one plane to land on another one, removing the pilot, and landing the plane, which is a bit much for the single word "landed" to carry.
- The enemy stole the aircrafts, and after some drama in flight managed to land several of them.
- The enemy used remote control to force them to land.
- The enemy used coercive force to force our pilots to land them.
- The enemy captured them.
- The enemy shot them down.
- During a friendly event while we set our differences with our enemy aside and agreed to fly each other's aircraft at an airshow for some reason, we landed several of theirs, and they landed several of ours.
- There was a hearing mistake and "energy" (as in energy beam beamed by a UFO) was accidentally transcribed as "enemy."
- The writer is just screwing with us.
- The writer is not a native speaker of English, and they made a mistake and actually meant that the enemy boarded several of our (parked) aircrafts.
- The writer is creative with language and believes that it would be cute to say that when an enemy projectile struck one of our aircrafts, then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars.
- An ML algorithm from the future traveled back in time, writing specific SuperGLUE examples to poison AI research, thereby preventing the emergence of a competitive AI which would also master the secrets of closed timelike curves
Actually the algo was able to determine we exist in a simulation and perform meta programming by hacking the sim infrastructure (higher order dimensions of spacetime) and rewriting the future which to us appears that it traveled to the past.
Ahh, just found an example where that's taken from https://glosbe.com/en/en/land. If you find on that page you'll see the exact sentence "the enemy landed several of our aircraft" (without the s after aircraft) which it says means "shoot down".
I have still never heard landed used in that way, and again in other dictionaries I searched I couldn't find that definition either. Thus, this is a case where the "AI" may get it "right", and me, the human would get it "wrong", but that still feels like it's missing a huge point. It feels you could get a number of errors by the human which the AI gets "right", but in fact the human is better able to detect what is rare, uncommon or at least ambiguous.
I've worked in aviation for 8 years and also didn't understand this use of "landed". I've heard "grounded" used like this: "The maintenance issues gounded the jet," but not "landed".
Working in aviation probably puts you in a mindset that makes it harder to parse. It's not being used in a way that is related to flight or aircraft.
It's like if people were discussing where to have a conference, and one of them proposed a hotel. Then another person suggested a resort. Then a third person floated a cruise ship. Cruise ships do float, but it has nothing to do with anything. They are floating the idea of the ship as a venue.
Do you normally "float" a cruise ship though? A more apt analogy might be "dock". Maybe a news report says that a vacation company has broken some regulation so the government docked a cruise ship, meaning they took away a cruise ship like you would dock someone points. It's ambiguous at best.
You could float the idea of it, and you might also think that to float a ship means the process by which it is landed in the water when coming out of a dock?
I think the sentence is referring to aircraft that have been forced to land by the enemy, in contrast to "grounded" aircraft that had not taken flight.
I haven't worked in aviation so my understanding of terminology could be wrong, but either way it is definitely an unusual example.
"The enemy landed 4 of our aircraft" without context wouldn't generally mean "forced to land" imo (as a native speaker). It would mean that they either destroyed them or managed to acquire them.
For example I might say that "they landed 4 aircraft with their daring" if they forced us to abandon an air craft carrier (e.g. by sinking it) and then managed to steal 4 of the planes (before it sunk). Or I might say "they landed 4 aircraft with that bomb" if they dropped a bomb on an airfield and it destroyed 4 aircraft.
Right, I think you understand the word as I do: 'verb' + ed. "The enemy landed the jet" as in they forced the jet to land either directly or indirectly. This would mean that the two sentences use "landed" the same way. But my understanding is SuperGLUE's offical answer is that these use "landed" differently with the rational that "landed" is idiomatic and just means to procure or bring about (e.g. "I landed the job") and it happens to be used with planes.
I think if we really looked at it, it likely comes from fishing where "to land" a fish means to succeed in quite literally getting it onto land from the water. But we use it as "to successfully get" (something typically uncertain) in many other contexts.
I agree, AI should realistically be able to detect the rare/uncommon/ambiguous usage as well, and rated for that.
I suppose in some case it could score better than humans on SuperGLUE benchmark.. but eventually it will have to come back down to near human score as it gets more accurate.
Why? In many of those benchmarks the average human score is not 100, but the AI progression doesn't really have a ceiling or a slow down at the human number. It should go through it and settle somewhere above. Plus we create these tests with our own limitations. There may be a world of more complexity or subtlelty that we all fail to grasp but the AI will.
I think humans are already behind at the face recognition task for example.
>If you find on that page you'll see the exact sentence "the enemy landed several of our aircraft" (without the s after aircraft) which it says means "shoot down".
They're not shy about illustrating a military application up front!
I've never seen "landed" used as in the second sentence, but I was definitely able to understand from context that it was not being used to mean the same thing as in the first sentence.
I haven't, though I'm familiar with that use of "landed" for fish.
As a lifelong native speaker (PNW English), I've also never heard "landed" used to refer to shooting down or capturing enemy airplanes. I could understand it from context, which is what I suppose the software is also going for, but I'd mark it with a red pen if someone showed me that sentence, just for clarity's sake (i.e. understandable from context but should be replaced).
'Landing' an aircraft does not imply shooting it down. 'Downing' an aircraft does imply that.
These uses of 'land' and 'down' are military euphemisms for the use of force to compel a reluctant pilot to land. The difference is the degree of violence used.
Involuntary 'landing' implies the aircraft is forced to land by a party other than the pilot because if the pilot did not comply the plane would be shot down or collide or crash. It usually implies survival of the pilot. 'Downing' also means involuntary removal of the aircraft from the sky, but does not denote that a violent landing did occur, only that the likelihood of violence is much greater because a (more abrupt) landing was forced upon the pilot. From what I've read, 'downing' usually implies the plane crashed.
I think the difference in these sentences is about the way to land. In sentence 1, the pilot of the aircraft is in control. In sentence 2, the pilots are not in control, the enemy forced them to land (whatever the means).
If I read these two sentences in context of some news, they would evoke very different "landing" scenes in my head.
In looking through many of the replies to this downstream, it appears that the system is actually correct in that there's an obscure use of 'land' at play in the second sentence.
It makes me think that there's going to be many adversarial examples of text that humans parse one way because of common usage while machines parse another way because of details like this.
For #2, my immediate read was that the planes had been shot down. If the context were to suggest that the enemy had somehow hijacked the planes, then of course the word land would mean the same in both sentences.
I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.
> I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.
It struck me as pretty awkward and very ambiguous. It probably means 'obtained' but 'captured' would be a far better word in that case. The suggestions that it means 'hit/shot' don't work because in that case it's not the aircraft that is landed but the shot, which is landed on the aircraft.
Also the use of the incorrect plural "aircrafts" when 'aircraft' is both singular and plural makes me think it's just a poor question.
The very fact that there's so much discussion about it is evidence that it's not straightforward even among native English speaking humans.
Seems like a really odd way of saying it but that’s what I’d think too, as in “landed their shots”.
This is either a poor question, or a really great question, if the goal of the test is to confuse computers where a human would normally say “huh, weird way of saying that but I guess they mean...”.
From the abstract of the associated paper: "performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research."
It occured to me that hn_throwaway_99's question, and the responses to it, is the sort of dialog in which one could find additional headroom for further research into natural language understanding. We can understand, for example, that while the two uses of 'landed' are different, they are not completely unrelated, and we can explain how they are related, for example by introducing a third construct, 'landed a fish', as a couple of replies have done.
I'd argue that greater-tham-human language ability is by definition useless.
Language is specifically a human communication tool, there's no value in surpassing the language skill that humans have, if indeed such a thing is even meaningful (what does it mean to be better than the best* French person at French?)
I disagree, greater-than-human-average is not useless. There's a lot of room for misinterpretation in human language. We compensate for that by non-verbal communication (posture, expression) or by asking for clarification. On top of that, most places have local expressions or idioms that are not necessarily globally recognized.
So there's two ways in which a language automaton must be better than human: it cannot rely on non-verbal hints nor can it easily ask for clarification, and it must be able to interpret many different dialects and idioms correctly -- many more than an average human would need to.
I do not think this result is that close to a greater-than-human language ability in general, and I do not think they are claiming it. I think the point is that, with scores on this test closely approaching average human scores, there is not much headroom for this particular test to drive, or measure, further progress.
So, here is the thing. ML shouldn't just be about learning rules. It should be about actually learning, and understanding.
Just because you've never heard the word used that way, you were able to infer it meant something different. Even with the use of aircrafts.
We all make mistakes when writing or speaking. We don't let that get in the way of interpreting the information being passed. Even if we post comments that contain errors.
Yes, the second should be, "The enemy downed several of our aircraft." Landed can be used to mean "bagged," as in, "We finally landed the Smith account," (it's a fishing term), but it should not be used in this figurative sense when referring to aircraft, because of the obvious confusion with the common, concrete sense of the word. And, yes, it should be aircraft.
It depends on what your goal is. But in most cases, I'd say no. If the goal has anything to do with understanding real language written by real humans, it's better for the system to be able to handle texts with errors.
True, but having some noise in the label is actually good for generalization. If it's only learned on perfectly correct sentences then its tolerance for mistakes will be very low.
It's weird, because I understood the second one as meaning shoot down, yet to me that's the same definition of landed. You just assume the enemy didn't land them gracefully without a scratch, because they are well, enemies.
So I would have answered that the word meant the same thing.
> One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)
Can anyone explain what makes this difficult for a machine? What existing knowledge does the machine start with? At a glance, it doesn't feel like it should be difficult if the machine had a large corpus to train on that showed many examples of each words in different contexts.
1) The pilot [voluntarily] brought down his aircraft.
2) The pilots [involuntarily] brought down their aircraft [because some authority figure(s) forced them down.]
The active verb 'land' can be performed by different actors: pilot vs a more powerful agent (usually who flies an armed aircraft). The voluntary/involuntary agency is a subtle difference that only those familiar with this military practice are likely to grok.
Possible, but also the worst context to use land in. You land a car, but if the game host would say you landed a small airplane, there’d be a laugh from the crowd.
The example looks like they're not written by native English speaker. It's funny reading English tests from other countries that are not English speaking because a lot of it focus on pedantics that are long lost while following a convention that would to us just feel _different_.
Yeah I don’t get the s at the end of aircraft. Landed, it would seem, would be land as in acquire, although that’s a bit odd of a construction. It seems rather forced. It may possibly mean that the aircraft were forced to land by the enemy. So it’s a tortured construction.
Still ambiguous. Landed as in make it contact the ground or landed as in obtain, like in landing a job?
For me taking an airborne object and making it touch the ground is pretty much the same meaning whether it's from the inside or remotely or shooting it down.
Well "he landed the deal" implies a score or a hit. So to say they "landed" the planes could vaguely make sense but it is hardly good English. They might have been thinking of "grounded"?
'Grounded' means the plane could not take off. It was on the ground and must remain there.
Landing a deal (or a fish) is like landing a plane. A human acts to cause a desired outcome. Unlike forcing a pilot to involuntarily land a plane, the perspective of the fish as involuntarily being forced to land is not a necessary inference for this use of 'land'.
I think people are digging too deep for an answer here... it seem to me to be a simple mistake, which on the scale at which they're evaluating those models is not statistically significant.
It's being used by analogy with "landing a fish". I've never heard it either, but I could believe it's in the argot of military airmen in some English-speaking country.
It's conceptually the same - having an entity go from water or air to the the ground. The hard part would be to associate the fact that there's no way for an 'enemy' to land the aircraft other than to do so forcibly which implies shooting it down.
The second implies that the aircrafts were shot down; the first states that the aircraft landed safely. It looks like this reduces to the machine being able to figure out whether or not something is good or bad for the speaker.
Good point and most of the replies ignore the key point to me which is; You are right about the plural of aircraft and the benchmark is horribly wrong, so why should we take any notice of this benchmark?
One thing to always point out in these cases is that the human baseline isn't "how well people do at this task," like it's often hyped to be. It's "how well does a person quickly and repetitively doing this do, on average." The 'quickly and repetitively' part is important because we all make more boneheaded errors in this scenario. The 'on average' part is important because the errors the algo makes aren't just fewer than people, they're different. The algos often still get certain things wrong that humans almost never would.
This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.
It seems to mean "How well does Mechanical Turk do the task?" which is a separate thing again. And yes - error type is at least as revealing as error frequency.
I have no idea where the real human baseline is, or how to find it.
Also, consider this discussion. GLUE winners may be able to make informed parsing guesses about single text blocks, but they're years away from being able to make a useful contribution to a discussion like this one.
Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous).
Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.
That would be ideal, if money weren't a factor. Since money is a factor, I wonder what the tradeoff is between labelling each instance N more times versus just getting N times more instances labeled.
There was an article[1] posted to HN recently about these benchmarks, and it was pretty skeptical.
Regarding SuperGLUE specifically, it asked:
"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"
This feels hollow. Can't this be said about any benchmark? It seems natural and proper that as one benchmark becomes saturated, we introduce harder benchmarks.
I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks.
Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence.
I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle.
"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."
This is true, and absolutely a weakness of these tests.
However they don't publish how well a human performs on the dataset without "not" in it.
They do initially note that Even human beings don’t do particularly well on this task without practice
I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.
I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).
Yes. This is more generally known as Goodhart's Law[0]: when a metric is used as a goal, then people will game the metric in order to win, making the metric useless.
There is no fundamental way to overcome this problem, except by not using metrics as goals.
Even when you will be able to have a 100% coherent and deep discussion with an AI over a niche technical domain, there will be people to pretend that the AI "fakes" it.
Systems like GPT-2, incredibly (I used to be a skeptic of a pure statistical approach) manage to extract meaning, keep a theme, and understand the intent behind a sentence. They are amazing.
When you have a system that displays all the characteristics of understanding something, it is irrelevant whether or not it "fakes" it. No one ever proved that humans are not "faking" intelligence either.
As long as they're not training on the test data, and they're not submitting hundreds of submissions tweaking parameters trying to improve their score, I don't see what the problem is. If the algorithm can do a great job at classifying hundreds of new test cases it has never seen, and it isn't over-fitted, then that means it is good at that specific task. Of course the task itself may or may not be useful, and you can have some meta discussion about what "understanding language" is, but the computer definitely is doing a super human job at that given task.
(I work in this field, although not specifically on benchmarking)
I think that this article makes a good point, and correctly identifies weaknesses.
However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.
I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.
the machines are always trained with the same dataset for each task. the biggest difference right now is small technical modifications on models that are also pre trained on gigantic unlabelled datasets. this doesn't feel like we're teaching them to do the test specifically at all
AX-b "is the broad-coverage diagnostic task, scored using
Matthews’ correlation (MCC). "
This is how the paper describes this test
"
Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed,
diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and
world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that
indicate the phenomena that characterize the relationship between the two sentences. Submissions
to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI
classifier on the diagnostic dataset, and analyses of the results were shown alongside the main
leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain
it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction
and neutral into a single not_entailment label, and request that submissions include predictions
on the resulting set from the model used for the RTE task. We collect non-expert annotations to
estimate human performance, following the same procedure we use for the main benchmark tasks
(Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the
two-class variant of the R3 metric used in GLUE) of 0.77.
"
If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.
How did T5 get such a high score if it scored so abysmally on the AX-b test?
The AX scores are not included in the total score.
From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."
If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.
Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.
RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract.
To be clear, here is the claim we make in the paper (we did not write the title of this post to HN):
> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.
I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.
My mistake - I had overlooked the AX-* scores being expressly omitted from these benchmarks. Maybe it is possible, then, that they could provide the additional headroom for further research?
Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.
Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.
Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:
- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.
- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.
- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.
- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.
However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.
I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?
This surprised me a bit, on the creation of the corpus they use for training:
"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."
I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.
I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.
They say they removed pages, not websites. Having false positives isn't a problem when you're still left with 750GB of data—quality matters more than slightly higher quantity at that point.
Sorry, I was thinking about pages even though I said websites. Native language interference (typically, we use the same term for pages and websites in my language).
Anyway, my point is not a matter of quantity. The way they're doing it, they have 750 GB of data, but they have exactly zero data that talks about bastards, fecal transplants, etc. So they may have a hard time answering questions about those specific subjects.
As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons:
1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers
2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.
3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.
4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).
"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."
I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.
---
We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:
* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).
* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.
* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.
I think that the point about the majority of tests being multiple-choice is the most important one to underline.
Structuring a problem as a multiple choice task is basically turning it into a classification problem, but it doesn't really answer the question everyone wants answered: is it really possible to reduce the problem of language understanding to classification? i.e. is it really possible to understand human language with no other ability than the ability to identify the classes of objects?
But that is a question that has to be answered before any performance on benchmarks that reduce language understanding to classification can be appraised correctly. If accurate classification is not sufficient for language understanding, then beating benchmarks like SuperGLUE tells us nothing new (we already know we have good classifiers).
The problem here is that we have no good measures of language understanding, of humans or machines- because we have a poor, er, understanding of our own language ability. Until we know more about what it means to understand language it won't be possible to evaluate automated language understanding systems very well.
Hopefully though, the skepticism I've observed around results like the one above, will lead to a renewed effort to research our language ability, and perhaps our intelligence in general.
> 2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.
...but, humans evolved the ability to use language over hundreds of generations... So... Maybe that's not such a bad thing?
Indeed this is important to realize: Training such a generic model from scratch does not only reiterate learning, but the entire evolutionary process that led to the emergence of neural circuits actually capable of such learning. That perspective makes many of the current achievements -- error-prone as they might be -- even more impressive!
> 1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here
Humans are susceptible to adversarial triggers too, so this doesn't necessarily make the model less impressive. It is a big problem in practical use though.
I don't think universal triggers exist, since at that point they are just language features. But there are plenty of less universal triggers
Let's imagine that that in the brain everything goes through a series of models, first tokenization into words, then we build something like an abstract syntax tree, then we analyse meaning in the context etc; and each time one of these steps reaches a nonsensical result we start over with additional parsing time allocated. It's probably not true, but close enough to be a useful model.
Now what you consider an adversarial example depends on how far down the stack it has to go until it's caught:
- "The old man the boat." fails in the early parsing steps. We reliably miscategorize old as adjective when it's a noun.
- "More people have been to Russia than I have, said Escher" goes a step further, it parses just fine but makes no sense. The tricky thing is that you might initially not notice that it makes no sense. This is about the level where AI is today.
- "Time flies like an arrow; fruit flies like a banana" makes perfect sense, but you could notice that the straight forward way to parse it leads to a non-sequitur and parsing it as "time-flies love eating arrows; fruit-flies love eating bananas" is probably a better way to parse it.
Of course that's just the parsing steps. You can trick human "sentiment analysis" by swapping words without changing the meaning. Compare "this bag is made from fake leather" to "this bag is made from vegan leather". PR and marketing have made a science out of how to make bad things sound good. Similarly PR is great at finding adversarial examples for reading comprehension, where they say one thing that's nearly universally understood to mean something different (or to mean nothing at all; or where something that seems to mean nothing at all actually means something very siginicant).
Of course we assume all text to be targeted to humans; so if something is widely misunderstood by humans we blame the sender for writing such a bad message; when it's widely misunderstood by AI we blame the AI for being so bad at reading.
"The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems."
"We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard."
Assuming that the baseline human score was set according to the performance of adult humans, then according to these results T5 has a language understanding ability at least as accurate as a human child.
In fact it's not just T5 that should be able to understand language as well as a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a score of 70 or more. There really shouldn't be anything else on the planet that has 70% of human language understanding, other than humans.
So if the benchmarks mean what they think they mean, there are currently fully-fledged strongly artificially intelligent systems. That must mean that, in a very short time we should see strong evidence of having created human-like intelligence.
Because make no mistake: language understanding is not like image recognition, say, or speech processing. Understanding anything is an AI-complete task, to use a colloquial term.
Let's wait and see then. It shouldn't take more than five or six years to figure out what all this means.
To clarify, I meant this comment as an expression of skepticism- I don't believe that the SuperGLUE benchmark really evaluates language understanding, or that BERT and friends are within a few percents of human language understanding. I think SuperGLUE is just another benchmark that is measuring something else than what it's supposed to be measuring (machine learning benchmarks usually do).
It seems that the teams behind the attempts to beat such benchmarks are aware of the weaknesses of the benchmarks though, so that's encouraging.
I attended one of the talks(1) of the Sam Bowman.
His talk was about "Task-Independent Language Understanding" and he also talked about GLUE and super GLUE; he mentioned that some models are passing an average person in experiments. They did some experiments to understand BERT's performance (2). (similar to article 'NLP's Clever Hans Moment') But they found a different answer to question "what BERT really knows," so he was skeptical about all conclusions. Check these out if you are interested in.
The AIs in the benchmark are all trained exclusively on text, correct?
My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.
I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.
I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.
Mmm. The philosophical position that it's essential to be embodied in order to have intelligence seems intuitively reasonable but is very much unproven. You will find philosophers and cognitive scientists who are sure you're right, but they don't have much hard evidence, and you will also find people like me who are pretty sure you're wrong but likewise have no hard evidence.
In the specific remember that deaf-blind people exist, so if you're sure that you "need something visual or auditory" then those people are not, according to your beliefs, able to understand language. I think they'll disagree with you quite strongly.
> remember that deaf-blind people exist [... ...] able to understand language
I got curious if/how deafblind people learn to communicate in the first place, if they are completely deafblind from birth. If humans can learn not just communication but language without either vision or hearing, that seems to suggest either extreme adaptability or language learning being quite decoupled from vision and hearing. From an evolutionary standpoint, I imagine that both deafness and blindness are probably uncommon enough that language learning could have explicit dependencies on both hearing and vision.
I found an old-looking video about communication with deafblind people. At the linked timestamp is a woman who is deafblind since age 2.
I think maybe CLEVR[0] dataset is what you are talking about?
Keep in mind that a most of the current ML systems have diverged from biology. A majority of the recent breakthroughs come from mathematics, the rational is that just because human brain does it in a certain way does not necessarily mean it is the only way to do it.
It's not just grounding the language in vision, but the embodiment, first person perspective and ability to interact with the environment. Humans have had the benefit of slowly evolving in a complex environment which is too expensive to recreate for artificial agents. We can only create very limited sims vs the real world.
"Attention is all you need", indeed. Of course, our instinct tells us there is more to language inference than word proximity. And so results approaching or exceeding expert-level human baseline raise more questions than providing cause for popping champagne corks.
In Question Answering, which is also advancing rapidly with insights from transformers and denoising auto-encoders, but still far from human baseline. The ease with which these models can answer a sample question such as: "Who was the first human in space", demonstrates both their efficacy and limitations. Pre-trained on a large corpus of text, almost every document that contains the the name "Yuri Gagarin" will in its near vicinity describe him in relation to his pioneering accomplishment for which he became a cultural icon.
And for even more generalizable scenarios, such as "what might you find on a Mayan monument"? It becomes imperative that an agent explain its reasoning in natural language as well to enable self-correcting backpropagation of error correction.
Language may be considered low-dimensional relatively speaking. And sentence prediction across quotidian tasks manageable in current state-of-the-art architectures. But looking at how difficult it is to predict the next N frames of video given a short input example demonstrates the intractability of the problem in higher dimensional spaces.
Neural Models for Speech and Language: Successes, Challenges, and the Relationship to COmputational Models of the Brain - Michael Collins
They came up with the SuperGLUE benchmark because they found that the GLUE benchmark was flawed and too easy to game. There were correlations in the dataset that made it possible to get questions right without real understanding, and so the results didn't generalize.
Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.
My experience with image classification benchmarks was that they approached human levels only because the scoring only counts how much they get “right” and doesn’t penalize completely whack answers as much as they should (like getting full credit for being pretty sure a picture of a dog was either a dog or an alligator). I suspect there’s something similar going on in these language benchmarks.
Use of Natural Language Understanding term in context of this benchmark is preposterous. No understanding takes place there. Please stick to NLP (Natural Language Processing) term for the next couple of decades. Thank you.
This clearly demonstrates once again that Google is miles ahead of the competition in AI. I mean, they just have the best data.
If you want to have an every day example of Google's AI skills: Switch you phone's keyboard to GBoard, especially all iOS users, and you will face a night and day difference to any other keyboard esepcially the stock one. When using multiple languages at the same time the leap to other keyboards gets even bigger.
GBoard is my phone's killer app and if Google dropped it for iOS I'd left the same day to Android.
That's how I used to feel, but it's turning into a nuisance.
It used to stick to single words or sometimes splitting one if missing a space, but now will sometimes attempt to "correct" the sum of two perfectly valid standalone words after the fact, 97% of the time resulting in nonsense.
I have the opposite experience. Yes, some of the suggestions from GBoard are useful, but I feel there's an equal number of times where I've typed a complete word, only to hit space and have the word auto-corrected to what GBoard was expecting. As a typing aid, it's almost unusable because of that.
Several of the systems in this leaderboard utilize the BERT model, a clever approach devised by Google for natural language processing. A nice laymen's guide to BERT:
My understanding is that a lot of these really high performance models that reach for every percentage-point possible require an absurd amount of hardware - specifically an absurd amount of GPU memory.
For example I have what I consider a fairly "high end" rig for being a hobbyist individual, with 32GB of RAM, i7 8700k, 1080ti - there's 0 chance their model would fit on my system.
So I mean maybe if you have a ton of money? Usually what happens is a slimmer model with not "quite" as high of a score gets released that actually fits on consumer hardware.
Maybe I'm oversimplifying, but it seems to me that once you have the model trained, it should be possible to partition it somehow when inferencing, to fit smaller machines. At least for a proof of concept it should be possible.
I'm not aware of any "partioning" strategies per se (at least during inference), but it's now common practice to distill a larger model to a smaller one by either
(a) training a smaller "student" network to replicate the larger "teacher" network, or
(b) pruning smaller weights from the larger network to reduce the size.
Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude.
The problem is that there is so many weights in the model that they don't fit in memory. You can lower the number of weights, which will lower the effectiveness of the model.
The thing is that when you're going for leaderboards you're reaching for every last percentage point, so the efficiency of the model size/performance isn't a concern, you want to ramp up the resource usage to as you have access to.
TL;DR - Yeah basically most people will run a "slimmed down" version of the model that isn't "as" performant, but is still an improvement over previous models and actually fits on your machine.
One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)
One of their examples, though, didn't make any sense to me:
1. The pilot managed to land the airplane safely
2. The enemy landed several of our aircrafts
It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.