Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline

hn_throwaway_99 · on Oct 25, 2019

I didn't know anything about SuperGLUE before (turns out it's a benchmark for language understanding tasks), so I clicked around their site where they show different examples of the tasks.

One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)

One of their examples, though, didn't make any sense to me:

1. The pilot managed to land the airplane safely

2. The enemy landed several of our aircrafts

It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.

rladd · on Oct 25, 2019

My mother got a perfect 800 score on the GRE English test many years ago when she wanted to go back to graduate school after her children were grown up enough (highschool/college age).

She told me that the way she got her perfect score was by realizing when the questions were wrong and thinking of what answer the test creators believed to be correct.

She had to outguess the test creators and answer the questions wrong -- in the "right" way.

This seems like a similar situation.

Huppie · on Oct 25, 2019

I've had the 'pleasure' of taking some 'Microsoft certifications' at various companies I worked at in the past and this sounds extremely familiar.

"I probably won't ever do it like that and/or there's a syntax error in all four of the answers... but this is the answer you want to hear. It's wrong, mind you, but it's what you want to hear."

justinclift · on Oct 25, 2019

Reminds me of the 1 question I got "wrong" on a DOS test (years ago) at TAFE.

The question was "How do you delete all files in the current directory?". Using DOS 6.22 (I think, it's from memory).

My answer "del." was marked incorrect. Because the teacher didn't know enough about DOS to understand that's the standard shortcut for "del .". And the teacher refused to even try out the command, lets alone fix the incorrect mark. sigh

jeremyvisser · on Oct 26, 2019

TAFE anecdote time!

In my TAFE class, I was asked to list two examples of operating systems.

I listed Linux and eComStation. The teacher had never heard of eComStation and marked me wrong.

Refused to correct my mark even when I proved him right. I'm still bitter about it a decade later.

justinclift · on Oct 27, 2019

Swinburne TAFE as well? ;)

zerkla · on Oct 25, 2019

Yep! You have to do away with conventional logic and ask yourself "What insanity would Microsoft recommend I do?"

Huppie · on Oct 25, 2019

It's not always insanity, sometimes just sub-optimal / way over-engineered in my opinion.

They're getting better at it though. More recently I've done their devops certification and it looks like they're recommending somewhat more sane practices now...

There were still questions where even after three or four tries at certification / reading up on whatever Microsoft thinks is 'good' we didn't find 'the correct answer' according to Microsoft though... ¯\_(ツ)_/¯

zerkla · on Oct 25, 2019

Yeah, that's true. It's still a good idea to get an idea of what a desired answer would be, which is why those answer dumps are so popular.

z3t4 · on Oct 25, 2019

I'm a spatial thinker, and I got a similar problem, I see all answers as correct. Eg. which one follows this sequence, and I can find a pattern to all alternatives. And I have to figure out which option the test author think is correct.

Double_a_92 · on Oct 25, 2019

Sometimes the questions are also just broken. I.e. asking you to select the things that do not apply, but the answer would have been to opposite.

Huppie · on Oct 25, 2019

Back when I took the 'C# certification' (70-483 I think?) there were multiple questions in the style 'which of the following answers will make the program compile', where all four answers had a syntax error, or the program had a syntax error at a different line that would cause an issue regardless of your answer.

I tried the dispute process but it's basically impossible to dispute / report broken questions unless you have a photographic memory.

qubex · on Oct 25, 2019

I have achieved similar results by similar means in both English and certain other subjects wherein one would assume a “true academic” would “know better” (picking out Sin[x]=2 as being “evidence of error in prior working” when x could merely be Complex, or marking “f[f[n]]=-n as “unsolvable” when it’s just requires a bit of lateral thinking). This always depresses me, like when (as a Brit) I hear Americans say “I could care less” as an indicator of disregard, when actually that indicates they are somewhere above the point of minimal regard.

chalupaman · on Oct 26, 2019

“I could care less” is sarcastic.

qubex · on Oct 27, 2019

No, it’s lax.

conjectures · on Oct 25, 2019

This does seem like the meta solution to most tests, particularly standardised tests :)

qubex · on Oct 25, 2019

Paraphrasing Simonyi: “Any test you can pass, I can pass meta”.

saagarjha · on Oct 25, 2019

That's how I got through the SAT…

throwaway936482 · on Oct 25, 2019

I think this is really interesting, because "the enemy landed several of our aircraft(s)" is the sort of sentence I'd have hauled a student up for using as a teacher, because 1) it's a none standard, arguably incorrect usage they've used either because they're a none native speaker or because they're trying to be clever and failing, and 2) because the plural of aircraft is aircraft. Nevertheless the author of this sentence almost certainly meant land to mean something different (shot down) than the author of the first, and we can infer the author's intended meaning despite the none standard usage. This poorly written sentence is the sort of thing you see all the time in the real world, especially from none native speakers, children, and people writing about a topic outside their expertise. If a program can spot the difference in the usage of the word land between these two sentences and infer what the intended meaning in the second sentence is, then it's doing pretty well. Just inferring that land is used to mean something different in the two sentences is less impressive but still pretty cool and I'm not sure which claim is being made.

HereBeBeasties · on Oct 25, 2019

If you teach others English, please learn the difference between "none" and "non". You mean "non-standard" in all your examples here (if British) or perhaps "nonstandard" (if American).

rendall · on Oct 25, 2019

Sssh, pumpkin. We live in a world of autocorract.

madis · on Oct 25, 2019

Sssh dysgraphic. It's not an excuse.

david-gpu · on Oct 25, 2019

Yup. They made the same mistake in "none native" (sic).

I'll admit that, as a non-native speaker, this fills me with glee.

arijun · on Oct 25, 2019

(And non-native)

saalweachter · on Oct 25, 2019

As someone who spends a lot of time puzzling out intent, I would infer they are using "landed" to mean "grounded" in that context.

ImprovedSilence · on Oct 25, 2019

I would have assumed the second used the term landed to mean acquired. But only after being told that it’s meaning is supposed to be different from the first. With no other context from those two sentences, I’d have guessed #2 Meant land the same way as #1

One other point: I’ve never heard the term “landed” to mean “grounded”, which is maybe the actual intent of #2, but maybe the ai sentence generation is off.....

_II__II_ · on Oct 25, 2019

The example directly below that: "Justify the margins" and "The end justifies the means" is the one I find dubious. Obviously the former could mean to format a document, but those exact words in that structure could be a demand for someone to justify a financial margin for example. It is both true and false depending on the context.

andrewstuart2 · on Oct 25, 2019

One of my favorite examples that I heard in a David Rock talk which I can no longer find on youtube: "Time flies like an arrow":

Time moves swiftly and in one direction.

Record the speed of flies in the same way you would an arrow.

Time flies, which are a kind of fly, are fond of an arrow. (e.g. Time flies like an arrow, fruit flies like a banana).

QuinnWilton · on Oct 25, 2019

It sounds like you're talking about garden-path sentences [0], and in particular: "time flies like an arrow; fruit flies like a banana" [1]. These are sentences whose structure tricks the reader into making an incorrect parse. My favourite of these has always been: "The horse raced past the barn fell".

[0] https://en.wikipedia.org/wiki/Garden-path_sentence

[1] https://en.wikipedia.org/wiki/Time_flies_like_an_arrow;_frui...

lsb · on Oct 25, 2019

I've always enjoyed the multiple valid parses of "Time flies like an arrow". I can't wait for AI to generate more Escher sentences like "More people have been to Russia than I have" ( https://en.m.wikipedia.org/wiki/Comparative_illusion )

Doxin · on Oct 25, 2019

You know, I only just now got the second interpretation of that sentence. I always thought of it like "Time flies like an arrow (straight and in one direction), Fruit flies like a banana (when thrown)"

Obvious in hindsight...

kees99 · on Oct 25, 2019

Same here, except it's comparing fly's flight trajectory to that of a banana is new to me.

saalweachter · on Oct 25, 2019

"The horse raced past the barn fell, which has been haunted since all those teenagers were murdered there."

(Noun-adjective is a rare formation, but amusingly more common in the same situations where the author uses rare and archaic definitions like the adjective "fell".)

changoplatanero · on Oct 25, 2019

"I eat my rice with butter." could mean that you use butter as a utensil to eat your rice with. There is often an unlikely way of parsing the sentence that gives an alternate meaning. The point is to test the computer to see if it can distinguish the likely parse from an unlikely one.

nl · on Oct 25, 2019

These aren't really alternate _parses_ though (in the sense that they don't give different parse trees). They do highlight the different possible meanings of "with" though.

I think "I eat my rice with chicken" vs "I eat my rice with children" vs "I eat my rice with chopsticks" is the canonical example here.

There's a whole field in NLP involved in showing what changes happen to entities mentioned in a sentence as a a side effect of the sentence, and this example shows it pretty well.

changoplatanero · on Oct 25, 2019

Wouldnt those be different parse trees? Like the "with X" could either be attached either to the verb or the noun

lultimouomo · on Oct 25, 2019

I think it's more clear if you say "I usually eat X with Y", i.e. Y it's either the company, the tool or the condiment that you eat with (contrasted with "I'm eating my X", where X is a dish like "rice with chicken")

nl · on Oct 25, 2019

Yes, possibly.

ekianjo · on Oct 25, 2019

A good demonstration that context (and cultural conditioning) is everything to understand what a text actually means.

otakucode · on Oct 25, 2019

Not to mention something that almost all NLP systems are resounding terrible at - short-term memory. If we've been talking about corporate financials for an hour and I say 'Justify the margins', it should be crystal clear what I mean. But most automated systems try to operate without a hint of memory or 'state' being tracked.

nebulous1 · on Oct 25, 2019

I'm guessing this is intentional. To a human, although this could be somebody being asked to justify their financial margins that's not a very likely answer. The human can easily see that, while it's possible they're the same meaning, given the lack of any other context the answer is that they're not.

The enemy could have landed several of our aircraft on one of their runways. Agassi may have beaten Becker over the head with his tennis racket. I suspect part of the test is that there can be other meanings that do technically work.

OrangeMango · on Oct 25, 2019

> The enemy could have landed several of our aircraft on one of their runways.

This is something that actually does happen. Less than 10 or 20 years ago, China did it to an US Air Force reconnaissance aircraft.

hn_throwaway_99 · on Oct 25, 2019

This is a good point I hadn't thought of. Honestly, I'm really not surprised anymore that the humans only scored 89%.

sjg007 · on Oct 25, 2019

The ends justify the means.

eindiran · on Oct 25, 2019

The second one means "the enemy successfully got several of our aircrafts".

Specifically, definition 3a or 3b for the verb form here: https://www.merriam-webster.com/dictionary/land

So potentially the enemy captured the aircraft (3a) or destroyed them (3b).

topspin · on Oct 25, 2019

Would a native English speaker use the word "landed" in this way? In the context of aircraft? "Landed" is badly ambiguous here and several distinct meanings are plausible. Captured is the most natural word given your interpretation.

Honestly that sentence -- the use of landed and that awful plural -- approaches engrish. Is that deliberate or is the use of English here just badly flawed? I can't see any other possibilities.

rjvs · on Oct 25, 2019

There are a lot of native English speakers in the world and not all of them use the same idioms that you do. This seems like perfectly valid English to me; some other words that could be used instead of “landed” in the aircraft sentence include “bagged”, “nabbed”, “poached”, “got” and “did in”. One of the entertaining aspects of English is the multitude of ways it can be used.

mcabbott · on Oct 25, 2019

Those are all good synonyms for "got" in the context of shooting at things. But none of the others already has a strong meaning in the context of aircraft, and this other meaning does create some confusion, which is why many speakers would avoid it (if thinking clearly).

gpm · on Oct 25, 2019

On the other hand they might go out of their way to use it to take advantage of the word play.

bllguo · on Oct 25, 2019

I wouldn't use it that way myself, but at the same time the intended meaning is clear as day to me from the context. I'm surprised by the reactions. "Enemy" should give it away immediately.

tombh · on Oct 25, 2019

I'm surprised too. This algorithm is about understanding language, and surely that includes understanding the intended usage. This is something humans have to do all the time. So what if there isn't a formally archived consensus on the definition of "landed" as used in the example. The intended meaning is clear, and so hats off to the algorithm for rolling with it, that is in my mind the fundamental goal of understanding language.

saalweachter · on Oct 25, 2019

It's more or less impressive depending on whether the algorithm already ate a dictionary; then it's the difference between inferring from context, as people do, and simply knowing all of the known unconventional usages in a very inhuman way.

Accacin · on Oct 25, 2019

I don't know. I guess I understood the sentence with 'landed' the same as I would have if someone told me that they'd 'landed a big job'. I wouldn't really say this myself though, although I hear people say 'landed a big catch' when they're talking about fishing.

treis · on Oct 25, 2019

Landed, with this meaning, is used in the context of successfully enticing someone to give you something. Like hooking a fish with bait.

mod · on Oct 25, 2019

FWIW, Landing a fish is not the same as hooking it. Landing a fish literally means pulling it to land (or boat).

So landing=catching=scoring.

Depending on the type of fishing, you can still be an underdog to land the fish after hooking it.

thrower123 · on Oct 25, 2019

I don't think anyone would use that particular construction, unless it's some weird dialect of pilot-speak or argot among anti-aircraft folk that I'm not aware of. It's just really awkward and unnatural. Possibly correct, but not the way that anybody actually talks.

wisty · on Oct 25, 2019

Possibly, you could say the planes were landed, as in forced to stay on the ground (because of damage, fear of enemy fire, or damage to the runway). But grounded would be better.

sgt101 · on Oct 25, 2019

I think it's archaic; in the past a fishing reference would have been more common and widely understood.

topspin · on Oct 25, 2019

I guess it annoys me because I suspect that if this is the sort of borderline incoherence one must wade through I would probably score below average.

tremon · on Oct 25, 2019

Or just average. There's contextual dependencies in most speech, and (as displayed in this subthread) not every speaker of a language has the same context. It's a fallacy to think that if you lack context for one of the examples, you will automatically score less than average -- other people may miss context for things obvious to you.

h0h0h0h0111 · on Oct 25, 2019

For me, this context sounds like "damaged, but in a minor way which forced them to leave the battle/exercise/war/whatever and go land"

LaMarseillaise · on Oct 25, 2019

If taking the "captured" interpretation, I think it could be reasonably inferred that they successfully landed the aircraft at an airfield afterwards (same meaning). This was my initial read of it and it does not seem strange to me on reflection.

I would like also to point out that even if we do interpret the second as meaning "destroyed", the first could then be interpreted as a combat aviator shooting down an opposing aircraft, bringing us back to the same meaning. Or perhaps both of my interpretations are correct and the meanings are different...

What this tells me is that the benchmark is not very useful.

TylerE · on Oct 25, 2019

Landed in the sense of a fisherman landing a marlin.

nmeofthestate · on Oct 25, 2019

So at the end of the process they were in possession of the enemy aircraft. Maybe they jumped across in mid-air and wrestled it off the other pilot.

parksy · on Oct 25, 2019

The benchmark is useful primarily because it puts humans and computers on a level playing field. Human readers will misinterpret written language, and human writers will poorly represent concepts.

The propensity to make mistakes in comprehension is unavoidable, humans only approach 90% accuracy, and computers are getting close to the same level of accuracy on the same base materials as humans.

The other way of testing would be to devise a test where there is only a single interpretation, where the context is clear, and there is no ambiguity in meaning. In that case a competent human and computer algorithm could be expected to answer all questions perfectly.

The purpose of this benchmark on the other hand is to test comprehension when meaning is not explicit and context clues are implied, something humans have had the advantage at over computers until quite recently. The computer won't be 100% accurate, but that's not the purpose of this test.

steve19 · on Oct 25, 2019

My immediate thought was captured ie. "Iran successfully landed our UAV by transmitting false GPS data".

This language is used on the Wikipedia page about that incident.

https://en.m.wikipedia.org/wiki/Iran–U.S._RQ-170_incident

Someone · on Oct 25, 2019

Aircraft typically get captured on the ground, or get forced to land by threat of being shot down. “Landed”, for me, would require the enemy to actively land the plane, just as “landing a fish” requires both the fisherman’s action and moving the fish from water to land.

I also wouldn’t use “landed” for destroying an enemy plane (neither by shooting it down nor by destroying it on the ground)

That, realistically, leaves hacking the plane’s electronics and then directing it to one’s own airfield.

andrewla · on Oct 25, 2019

Yes -- if the sentence had been "grounded the aircraft", then the meaning is obvious. But even though "land" is a synonym for "ground" I don't think there's an equivalence of meaning here. I'm struggling to find a sense in which "landing and enemy aircraft" is a meaningful concept short of jumping out of one plane to land on another one, removing the pilot, and landing the plane, which is a bit much for the single word "landed" to carry.

natch · on Oct 25, 2019

So many options for sentence number two.

- The enemy stole the aircrafts, and after some drama in flight managed to land several of them.

- The enemy used remote control to force them to land.

- The enemy used coercive force to force our pilots to land them.

- The enemy captured them.

- The enemy shot them down.

- During a friendly event while we set our differences with our enemy aside and agreed to fly each other's aircraft at an airshow for some reason, we landed several of theirs, and they landed several of ours.

- There was a hearing mistake and "energy" (as in energy beam beamed by a UFO) was accidentally transcribed as "enemy."

- The writer is just screwing with us.

- The writer is not a native speaker of English, and they made a mistake and actually meant that the enemy boarded several of our (parked) aircrafts.

- The writer is creative with language and believes that it would be cute to say that when an enemy projectile struck one of our aircrafts, then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars.

ethbro · on Oct 25, 2019

- An ML algorithm from the future traveled back in time, writing specific SuperGLUE examples to poison AI research, thereby preventing the emergence of a competitive AI which would also master the secrets of closed timelike curves

solotronics · on Oct 25, 2019

Actually the algo was able to determine we exist in a simulation and perform meta programming by hacking the sim infrastructure (higher order dimensions of spacetime) and rewriting the future which to us appears that it traveled to the past.

mcphage · on Oct 25, 2019

> then the enemy has "landed" that aircraft as one would land men on the moon or land rovers (no pun intended) on Mars

Or perhaps as one would land a punch.

matthewowen · on Oct 25, 2019

I think it's landed in the same sense as "landed a deal": got, or achieved, in this case achieving shooting them down.

For me, my first read of the sentence would definitely be that it means shot down.

hn_throwaway_99 · on Oct 25, 2019

Ahh, just found an example where that's taken from https://glosbe.com/en/en/land. If you find on that page you'll see the exact sentence "the enemy landed several of our aircraft" (without the s after aircraft) which it says means "shoot down".

I have still never heard landed used in that way, and again in other dictionaries I searched I couldn't find that definition either. Thus, this is a case where the "AI" may get it "right", and me, the human would get it "wrong", but that still feels like it's missing a huge point. It feels you could get a number of errors by the human which the AI gets "right", but in fact the human is better able to detect what is rare, uncommon or at least ambiguous.

mike00632 · on Oct 25, 2019

I've worked in aviation for 8 years and also didn't understand this use of "landed". I've heard "grounded" used like this: "The maintenance issues gounded the jet," but not "landed".

adrianmonk · on Oct 25, 2019

Working in aviation probably puts you in a mindset that makes it harder to parse. It's not being used in a way that is related to flight or aircraft.

It's like if people were discussing where to have a conference, and one of them proposed a hotel. Then another person suggested a resort. Then a third person floated a cruise ship. Cruise ships do float, but it has nothing to do with anything. They are floating the idea of the ship as a venue.

loeg · on Oct 25, 2019

Plenty of other HNers, myself included, don't work in aviation and still find this use of "landed" nonsensical.

mike00632 · on Oct 25, 2019

Do you normally "float" a cruise ship though? A more apt analogy might be "dock". Maybe a news report says that a vacation company has broken some regulation so the government docked a cruise ship, meaning they took away a cruise ship like you would dock someone points. It's ambiguous at best.

animal531 · on Oct 25, 2019

You could float the idea of it, and you might also think that to float a ship means the process by which it is landed in the water when coming out of a dock?

longblack · on Oct 25, 2019

I think the sentence is referring to aircraft that have been forced to land by the enemy, in contrast to "grounded" aircraft that had not taken flight.

I haven't worked in aviation so my understanding of terminology could be wrong, but either way it is definitely an unusual example.

gpm · on Oct 25, 2019

"The enemy landed 4 of our aircraft" without context wouldn't generally mean "forced to land" imo (as a native speaker). It would mean that they either destroyed them or managed to acquire them.

For example I might say that "they landed 4 aircraft with their daring" if they forced us to abandon an air craft carrier (e.g. by sinking it) and then managed to steal 4 of the planes (before it sunk). Or I might say "they landed 4 aircraft with that bomb" if they dropped a bomb on an airfield and it destroyed 4 aircraft.

mike00632 · on Oct 25, 2019

Right, I think you understand the word as I do: 'verb' + ed. "The enemy landed the jet" as in they forced the jet to land either directly or indirectly. This would mean that the two sentences use "landed" the same way. But my understanding is SuperGLUE's offical answer is that these use "landed" differently with the rational that "landed" is idiomatic and just means to procure or bring about (e.g. "I landed the job") and it happens to be used with planes.

michaelt · on Oct 25, 2019

A fishing boat can land a big catch - and a sales executive might have landed a big deal, perhaps after reeling them in or having them on the hook.

So this would be particularly apt wording if the enemy had thrown a net over the plane as it sank in the ocean.

But I prefer to think the enemy gifted british country estates to the planes.

Enginerrrd · on Oct 25, 2019

I think if we really looked at it, it likely comes from fishing where "to land" a fish means to succeed in quite literally getting it onto land from the water. But we use it as "to successfully get" (something typically uncertain) in many other contexts.

chrisweekly · on Oct 25, 2019

though you can "land" a fish while still on your boat.

disclaimer: beyond pedantic, but 100% appropriate given the topic is NLP and idioms

Enginerrrd · on Oct 25, 2019

Sure, although what is a boat but an island to a fish?

fouc · on Oct 25, 2019

I agree, AI should realistically be able to detect the rare/uncommon/ambiguous usage as well, and rated for that.

I suppose in some case it could score better than humans on SuperGLUE benchmark.. but eventually it will have to come back down to near human score as it gets more accurate.

jobigoud · on Oct 25, 2019

Why? In many of those benchmarks the average human score is not 100, but the AI progression doesn't really have a ceiling or a slow down at the human number. It should go through it and settle somewhere above. Plus we create these tests with our own limitations. There may be a world of more complexity or subtlelty that we all fail to grasp but the AI will.

I think humans are already behind at the face recognition task for example.

rhizome · on Oct 25, 2019

>If you find on that page you'll see the exact sentence "the enemy landed several of our aircraft" (without the s after aircraft) which it says means "shoot down".

They're not shy about illustrating a military application up front!

otterpop · on Oct 25, 2019

so this is why the human score is 89.8 :)

theaeolist · on Oct 25, 2019

> I think it's landed in the same sense as "landed a deal": got, or achieved, in this case achieving shooting them down.

My buddy is a pilot and they always say "I landed the takeoff pretty good. PRETTY GOOD!"

bobbyi_settv · on Oct 25, 2019

I've never seen "landed" used as in the second sentence, but I was definitely able to understand from context that it was not being used to mean the same thing as in the first sentence.

amingilani · on Oct 25, 2019

Have you ever "landed" a deal? Or "landed" first strike in a game?

archontes · on Oct 25, 2019

You've never landed a fish?

Frondo · on Oct 25, 2019

I haven't, though I'm familiar with that use of "landed" for fish.

As a lifelong native speaker (PNW English), I've also never heard "landed" used to refer to shooting down or capturing enemy airplanes. I could understand it from context, which is what I suppose the software is also going for, but I'd mark it with a red pen if someone showed me that sentence, just for clarity's sake (i.e. understandable from context but should be replaced).

randcraw · on Oct 25, 2019

'Landing' an aircraft does not imply shooting it down. 'Downing' an aircraft does imply that.

These uses of 'land' and 'down' are military euphemisms for the use of force to compel a reluctant pilot to land. The difference is the degree of violence used.

Involuntary 'landing' implies the aircraft is forced to land by a party other than the pilot because if the pilot did not comply the plane would be shot down or collide or crash. It usually implies survival of the pilot. 'Downing' also means involuntary removal of the aircraft from the sky, but does not denote that a violent landing did occur, only that the likelihood of violence is much greater because a (more abrupt) landing was forced upon the pilot. From what I've read, 'downing' usually implies the plane crashed.

tremon · on Oct 25, 2019

Is "landing a fish" the same thing as "watering a plant"?

hossbeast · on Oct 25, 2019

Have caught a fish though

cbetti · on Oct 25, 2019

Is land an acceptable habitat for a fish?

brainless · on Oct 25, 2019

I think the difference in these sentences is about the way to land. In sentence 1, the pilot of the aircraft is in control. In sentence 2, the pilots are not in control, the enemy forced them to land (whatever the means).

If I read these two sentences in context of some news, they would evoke very different "landing" scenes in my head.

mohaine · on Oct 25, 2019

#2 is the same as landing a fish. i.e. to place on land what doesn't belong on land.

im3w1l · on Oct 25, 2019

The only possible explanation I can think of is this.

3. a : to catch and bring in

// land a fish

b : gain, secure

// land a job landed the leading role

imagine enemy soldiers capturing a base or hangar ship including the aircraft.

It's kind of a stretch though.

mooman219 · on Oct 25, 2019

This is definitely where the 10.2% of human failures are.

jdale27 · on Oct 25, 2019

p1esk · on Oct 25, 2019

Ever tried taking SAT or GRE tests?

jcims · on Oct 25, 2019

In looking through many of the replies to this downstream, it appears that the system is actually correct in that there's an obscure use of 'land' at play in the second sentence.

It makes me think that there's going to be many adversarial examples of text that humans parse one way because of common usage while machines parse another way because of details like this.

devin · on Oct 25, 2019

Colorless green ideas sleep furiously!

Search for it if you’re interested in its origin.

rahimnathwani · on Oct 25, 2019

For #2, my immediate read was that the planes had been shot down. If the context were to suggest that the enemy had somehow hijacked the planes, then of course the word land would mean the same in both sentences.

I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.

taneq · on Oct 25, 2019

> I have never used or heard 'land a plane' in this context, but the sentence didn't immediately strike me as unnatural, incorrect or unclear.

It struck me as pretty awkward and very ambiguous. It probably means 'obtained' but 'captured' would be a far better word in that case. The suggestions that it means 'hit/shot' don't work because in that case it's not the aircraft that is landed but the shot, which is landed on the aircraft.

Also the use of the incorrect plural "aircrafts" when 'aircraft' is both singular and plural makes me think it's just a poor question.

The very fact that there's so much discussion about it is evidence that it's not straightforward even among native English speaking humans.

goldenkey · on Oct 25, 2019

Time matters too. Current tech would hint to us that the planes had been shot down.

But in the future that sentence might mean hacking and theft of the actual planes, an actual landing.

tanilama · on Oct 25, 2019

It means 'succeeded in shooting down' right? Seems pretty contextual, but understandable.

vonseel · on Oct 25, 2019

Seems like a really odd way of saying it but that’s what I’d think too, as in “landed their shots”.

This is either a poor question, or a really great question, if the goal of the test is to confuse computers where a human would normally say “huh, weird way of saying that but I guess they mean...”.

mannykannot · on Oct 25, 2019

From the abstract of the associated paper: "performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research."

It occured to me that hn_throwaway_99's question, and the responses to it, is the sort of dialog in which one could find additional headroom for further research into natural language understanding. We can understand, for example, that while the two uses of 'landed' are different, they are not completely unrelated, and we can explain how they are related, for example by introducing a third construct, 'landed a fish', as a couple of replies have done.

briga · on Oct 25, 2019

Limited headroom? Seems like they're assuming greater-than-human language ability is just impossible and will never be surpassed.

wyattpeak · on Oct 25, 2019

I'd argue that greater-tham-human language ability is by definition useless.

Language is specifically a human communication tool, there's no value in surpassing the language skill that humans have, if indeed such a thing is even meaningful (what does it mean to be better than the best* French person at French?)

* By whatever language-related metric

tremon · on Oct 25, 2019

I disagree, greater-than-human-average is not useless. There's a lot of room for misinterpretation in human language. We compensate for that by non-verbal communication (posture, expression) or by asking for clarification. On top of that, most places have local expressions or idioms that are not necessarily globally recognized.

So there's two ways in which a language automaton must be better than human: it cannot rely on non-verbal hints nor can it easily ask for clarification, and it must be able to interpret many different dialects and idioms correctly -- many more than an average human would need to.

mannykannot · on Oct 25, 2019

I do not think this result is that close to a greater-than-human language ability in general, and I do not think they are claiming it. I think the point is that, with scores on this test closely approaching average human scores, there is not much headroom for this particular test to drive, or measure, further progress.

ars · on Oct 25, 2019

It's a reasonable assumption if only for the simple reason that humans said the sentences being tested, so how would you surpass that?

jobigoud · on Oct 25, 2019

You create a new test designed by your newly better-than-human language experts.

jasonlotito · on Oct 25, 2019

So, here is the thing. ML shouldn't just be about learning rules. It should be about actually learning, and understanding.

Just because you've never heard the word used that way, you were able to infer it meant something different. Even with the use of aircrafts.

We all make mistakes when writing or speaking. We don't let that get in the way of interpreting the information being passed. Even if we post comments that contain errors.

SiVal · on Oct 25, 2019

Yes, the second should be, "The enemy downed several of our aircraft." Landed can be used to mean "bagged," as in, "We finally landed the Smith account," (it's a fishing term), but it should not be used in this figurative sense when referring to aircraft, because of the obvious confusion with the common, concrete sense of the word. And, yes, it should be aircraft.

throwaway1777 · on Oct 25, 2019

The fact this comment sparked so much discussion with some agreeing and some disagreeing says to me that Google did about as well as a human.

alex_young · on Oct 25, 2019

The plural of aircraft is aircraft. Not aircrafts. https://www.grammar-monster.com/plurals/plural_of_aircraft.h...

Maybe the _examples_ for a language test should be grammatically correct?

Al-Khwarizmi · on Oct 25, 2019

It depends on what your goal is. But in most cases, I'd say no. If the goal has anything to do with understanding real language written by real humans, it's better for the system to be able to handle texts with errors.

BiasRegularizer · on Oct 25, 2019

True, but having some noise in the label is actually good for generalization. If it's only learned on perfectly correct sentences then its tolerance for mistakes will be very low.

jasonlotito · on Oct 25, 2019

Maybe the examples for a language test should use language that people actually use every day.

didibus · on Oct 25, 2019

It's weird, because I understood the second one as meaning shoot down, yet to me that's the same definition of landed. You just assume the enemy didn't land them gracefully without a scratch, because they are well, enemies.

So I would have answered that the word meant the same thing.

seanwilson · on Oct 25, 2019

> One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)

Can anyone explain what makes this difficult for a machine? What existing knowledge does the machine start with? At a glance, it doesn't feel like it should be difficult if the machine had a large corpus to train on that showed many examples of each words in different contexts.

randcraw · on Oct 25, 2019

1) The pilot [voluntarily] brought down his aircraft.

2) The pilots [involuntarily] brought down their aircraft [because some authority figure(s) forced them down.]

The active verb 'land' can be performed by different actors: pilot vs a more powerful agent (usually who flies an armed aircraft). The voluntary/involuntary agency is a subtle difference that only those familiar with this military practice are likely to grok.

ebg13 · on Oct 25, 2019

> I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means

Clearly the enemy conferred lesser nobility and commensurate landownership unto said aircrafts. https://en.wikipedia.org/wiki/Landed_gentry

mrosett · on Oct 25, 2019

I believe it’s “land” in the sense of “land a fish” (or a prize in general) which is a less common but legitimate usage.

gradstudent · on Oct 25, 2019

Perhaps the enemy obtained several of our aircraft. In the same sense as one might land a new car in a contest.

nothrabannosir · on Oct 25, 2019

Possible, but also the worst context to use land in. You land a car, but if the game host would say you landed a small airplane, there’d be a laugh from the crowd.

Aperocky · on Oct 25, 2019

The example looks like they're not written by native English speaker. It's funny reading English tests from other countries that are not English speaking because a lot of it focus on pedantics that are long lost while following a convention that would to us just feel _different_.

somebodythere · on Oct 25, 2019

In option 1, the aircraft met the ground gently and safely.

In option 2, the aircraft met the ground violently and lethally.

p1esk · on Oct 25, 2019

Not necessarily. Maybe they captured it.

mc32 · on Oct 25, 2019

Yeah I don’t get the s at the end of aircraft. Landed, it would seem, would be land as in acquire, although that’s a bit odd of a construction. It seems rather forced. It may possibly mean that the aircraft were forced to land by the enemy. So it’s a tortured construction.

mike00632 · on Oct 25, 2019

2. Reminds me a of a theory that Iran landed an American stealth drone by sending spoofed signals.

jobigoud · on Oct 25, 2019

Still ambiguous. Landed as in make it contact the ground or landed as in obtain, like in landing a job?

For me taking an airborne object and making it touch the ground is pretty much the same meaning whether it's from the inside or remotely or shooting it down.

mike00632 · on Oct 25, 2019

Yes. I think "ambiguous" is the best word to describe all of this.

ValleZ · on Oct 25, 2019

I'm not a native English speaker and it is pretty obvious to me what both sentences mean.

loeg · on Oct 25, 2019

That might help. I don't think a native speaker would ever say it this way.

jobigoud · on Oct 25, 2019

Yeah, I gues you can use landed in this way but you would never use it with "planes" because it would make the whole thing awkward and ambiguous.

nicoburns · on Oct 25, 2019

I think they might going for 'landed' as in 'landed a deal'. Maybe?

scandox · on Oct 25, 2019

Well "he landed the deal" implies a score or a hit. So to say they "landed" the planes could vaguely make sense but it is hardly good English. They might have been thinking of "grounded"?

randcraw · on Oct 25, 2019

'Grounded' means the plane could not take off. It was on the ground and must remain there.

Landing a deal (or a fish) is like landing a plane. A human acts to cause a desired outcome. Unlike forcing a pilot to involuntarily land a plane, the perspective of the fish as involuntarily being forced to land is not a necessary inference for this use of 'land'.

Geez, language can be subtle.

scarletmantis · on Oct 25, 2019

I understood 'landed' as an euphemism for 'shot down'.

m3at · on Oct 25, 2019

I think people are digging too deep for an answer here... it seem to me to be a simple mistake, which on the scale at which they're evaluating those models is not statistically significant.

dmurray · on Oct 25, 2019

It's being used by analogy with "landing a fish". I've never heard it either, but I could believe it's in the argot of military airmen in some English-speaking country.

main5tream · on Oct 25, 2019

It's conceptually the same - having an entity go from water or air to the the ground. The hard part would be to associate the fact that there's no way for an 'enemy' to land the aircraft other than to do so forcibly which implies shooting it down.

NicoJuicy · on Oct 25, 2019

That's probably why humans have 89,9% and not 90,1% :p

gvhst · on Oct 25, 2019

It sounds like the “landed” in 2. is similar in usage as “landed” used in the turn of phrase “he landed the deal.”

infradig · on Oct 25, 2019

Same as "to land a punch". To successfully hit a target.

anonytrary · on Oct 25, 2019

The second implies that the aircrafts were shot down; the first states that the aircraft landed safely. It looks like this reduces to the machine being able to figure out whether or not something is good or bad for the speaker.

billforsternz · on Oct 25, 2019

Good point and most of the replies ignore the key point to me which is; You are right about the plural of aircraft and the benchmark is horribly wrong, so why should we take any notice of this benchmark?

option_greek · on Oct 25, 2019

Probably: The enemy grounded several of our aircrafts

qroshan · on Oct 25, 2019

I landed this job

6gvONxR4sf7o · on Oct 25, 2019

One thing to always point out in these cases is that the human baseline isn't "how well people do at this task," like it's often hyped to be. It's "how well does a person quickly and repetitively doing this do, on average." The 'quickly and repetitively' part is important because we all make more boneheaded errors in this scenario. The 'on average' part is important because the errors the algo makes aren't just fewer than people, they're different. The algos often still get certain things wrong that humans almost never would.

This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.

TheOtherHobbes · on Oct 25, 2019

It seems to mean "How well does Mechanical Turk do the task?" which is a separate thing again. And yes - error type is at least as revealing as error frequency.

I have no idea where the real human baseline is, or how to find it.

Also, consider this discussion. GLUE winners may be able to make informed parsing guesses about single text blocks, but they're years away from being able to make a useful contribution to a discussion like this one.

IshKebab · on Oct 25, 2019

Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous).

Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.

6gvONxR4sf7o · on Oct 25, 2019

That would be ideal, if money weren't a factor. Since money is a factor, I wonder what the tradeoff is between labelling each instance N more times versus just getting N times more instances labeled.

Pahr3yah · on Oct 25, 2019

In the context of GPT2 someone coined the expression "Humans Who Are Not Concentrating Are Not General Intelligences"

The_Amp_Walrus · on Oct 25, 2019

I think it was this blogger: https://www.google.com/amp/s/srconstantin.wordpress.com/2019...

jcims · on Oct 25, 2019

Great point! It makes sense in the context of what these algorithms would generally be tasked with.

pmoriarty · on Oct 25, 2019

There was an article[1] posted to HN recently about these benchmarks, and it was pretty skeptical.

Regarding SuperGLUE specifically, it asked:

"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"

[1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...

gradys · on Oct 25, 2019

This feels hollow. Can't this be said about any benchmark? It seems natural and proper that as one benchmark becomes saturated, we introduce harder benchmarks.

I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks.

Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence.

I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle.

not2b · on Oct 25, 2019

From the Quanta article:

"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."

nl · on Oct 25, 2019

This is true, and absolutely a weakness of these tests.

However they don't publish how well a human performs on the dataset without "not" in it.

They do initially note that Even human beings don’t do particularly well on this task without practice

I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).

[1] https://www.aclweb.org/anthology/P19-1334.pdf

pdkl95 · on Oct 25, 2019

> Can't this be said about any benchmark?

Maybe it should be? The "dieselgate" talk[1] at 32c3 suggests engineering has gotten very good[2] at "teaching machines to the test".

[1] https://media.ccc.de/v/32c3-7331-the_exhaust_emissions_scand... (good text summary: https://lwn.net/Articles/670488/ )

[2] https://static.lwn.net/images/2016/vw-curves.png

lidHanteyk · on Oct 25, 2019

Yes. This is more generally known as Goodhart's Law[0]: when a metric is used as a goal, then people will game the metric in order to win, making the metric useless.

There is no fundamental way to overcome this problem, except by not using metrics as goals.

[0] https://en.wikipedia.org/wiki/Goodhart's_law

6gvONxR4sf7o · on Oct 25, 2019

You'd be absolutely right, if only this kind of event didn't so often trigger pop articles about how AI is now superhuman at XYZ.

throwaway_bad · on Oct 25, 2019

Andrew Ng has a great summary on the purpose of human level performance: https://www.coursera.org/lecture/machine-learning-projects/w...

Iv · on Oct 25, 2019

Even when you will be able to have a 100% coherent and deep discussion with an AI over a niche technical domain, there will be people to pretend that the AI "fakes" it.

Systems like GPT-2, incredibly (I used to be a skeptic of a pure statistical approach) manage to extract meaning, keep a theme, and understand the intent behind a sentence. They are amazing.

When you have a system that displays all the characteristics of understanding something, it is irrelevant whether or not it "fakes" it. No one ever proved that humans are not "faking" intelligence either.

ehsankia · on Oct 25, 2019

As long as they're not training on the test data, and they're not submitting hundreds of submissions tweaking parameters trying to improve their score, I don't see what the problem is. If the algorithm can do a great job at classifying hundreds of new test cases it has never seen, and it isn't over-fitted, then that means it is good at that specific task. Of course the task itself may or may not be useful, and you can have some meta discussion about what "understanding language" is, but the computer definitely is doing a super human job at that given task.

blazespin · on Oct 25, 2019

Maybe it's over-fitted on the new data. There has to be a constant infusion of new training data and a system can only prove itself over time.

These rankings, if real, should be in constant flux.

nl · on Oct 25, 2019

(I work in this field, although not specifically on benchmarking)

I think that this article makes a good point, and correctly identifies weaknesses.

However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.

I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.

make3 · on Nov 4, 2019

the machines are always trained with the same dataset for each task. the biggest difference right now is small technical modifications on models that are also pre trained on gigantic unlabelled datasets. this doesn't feel like we're teaching them to do the test specifically at all

RcouF1uZ4gsC · on Oct 25, 2019

I think classifying this as human level is misleading.

Look at the sub-scores on the page. One score that looks very different from humans is AX-b.

The SuperGlue paper provides more context about AX-b

https://arxiv.org/pdf/1905.00537.pdf

AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "

This is how the paper describes this test

" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "

If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.

How did T5 get such a high score if it scored so abysmally on the AX-b test?

The AX scores are not included in the total score.

From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."

If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.

craffel · on Oct 25, 2019

Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.

mannykannot · on Oct 25, 2019

RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract.

craffel · on Oct 25, 2019

To be clear, here is the claim we make in the paper (we did not write the title of this post to HN):

> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.

I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.

mannykannot · on Oct 25, 2019

My mistake - I had overlooked the AX-* scores being expressly omitted from these benchmarks. Maybe it is possible, then, that they could provide the additional headroom for further research?

Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.

foota · on Oct 25, 2019

I find it strange that they exclude it? Perhaps the reason is related?

throwaway_bad · on Oct 25, 2019

Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?

For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.

craffel · on Oct 25, 2019

Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:

- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.

- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.

- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.

- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.

throwaway_bad · on Oct 25, 2019

> Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...

Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).

craffel · on Oct 25, 2019

It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232

However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.

calabin · on Oct 25, 2019

I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?

taneq · on Oct 25, 2019

You don't. Even humans frequently leak information like this. It's just a consequence of having incomplete or incompletely analyzed information.

jcims · on Oct 25, 2019

Not dumb at all and probably a major challenge when developing benchmarks.

Al-Khwarizmi · on Oct 25, 2019

This surprised me a bit, on the creation of the corpus they use for training:

"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."

I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.

I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.

Veedrac · on Oct 25, 2019

They say they removed pages, not websites. Having false positives isn't a problem when you're still left with 750GB of data—quality matters more than slightly higher quantity at that point.

Al-Khwarizmi · on Oct 25, 2019

Sorry, I was thinking about pages even though I said websites. Native language interference (typically, we use the same term for pages and websites in my language).

Anyway, my point is not a matter of quantity. The way they're doing it, they have 750 GB of data, but they have exactly zero data that talks about bastards, fecal transplants, etc. So they may have a hard time answering questions about those specific subjects.

nopinsight · on Oct 25, 2019

As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons:

1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers

2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.

4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf

"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."

I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.

---

We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:

* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).

* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.

* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.

YeGoblynQueenne · on Oct 25, 2019

I think that the point about the majority of tests being multiple-choice is the most important one to underline.

Structuring a problem as a multiple choice task is basically turning it into a classification problem, but it doesn't really answer the question everyone wants answered: is it really possible to reduce the problem of language understanding to classification? i.e. is it really possible to understand human language with no other ability than the ability to identify the classes of objects?

But that is a question that has to be answered before any performance on benchmarks that reduce language understanding to classification can be appraised correctly. If accurate classification is not sufficient for language understanding, then beating benchmarks like SuperGLUE tells us nothing new (we already know we have good classifiers).

The problem here is that we have no good measures of language understanding, of humans or machines- because we have a poor, er, understanding of our own language ability. Until we know more about what it means to understand language it won't be possible to evaluate automated language understanding systems very well.

Hopefully though, the skepticism I've observed around results like the one above, will lead to a renewed effort to research our language ability, and perhaps our intelligence in general.

VikingCoder · on Oct 25, 2019

> 2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

...but, humans evolved the ability to use language over hundreds of generations... So... Maybe that's not such a bad thing?

msamwald · on Oct 25, 2019

Indeed this is important to realize: Training such a generic model from scratch does not only reiterate learning, but the entire evolutionary process that led to the emergence of neural circuits actually capable of such learning. That perspective makes many of the current achievements -- error-prone as they might be -- even more impressive!

nopinsight · on Oct 25, 2019

The amount of data required may not be a decisive factor but rather a canary in the coal mine that something is off.

If we wish to use a model in critical situations, such as a medical setting or commanding a self-driving car, 1) and 4) above cannot be ignored.

wongarsu · on Oct 25, 2019

> 1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here

Humans are susceptible to adversarial triggers too, so this doesn't necessarily make the model less impressive. It is a big problem in practical use though.

nopinsight · on Oct 25, 2019

I am curious on what you mean by adversarial examples/triggers for humans in the domain of natural language.

Off the top of my head, I can think of:

* garden path sentences

* highly recursive sentences

Could you or anyone provide some other classes?

The two classes above however can generally be understood by a large number of educated native speakers with time to think carefully.

Humans also do not get derailed so badly as in the examples in this link. http://www.ericswallace.com/triggers