More States Opting to 'Robo-Grade' Student Essays by Computer

thomasahle · on July 2, 2018

> Developers of so-called “robo-graders” say they understand why many students and teachers would be skeptical of the idea. But they insist, with computers already doing jobs as complicated and as fraught as driving cars, detecting cancer, and carrying on conversations, they can certainly handle grading students’ essays.

> One year, she says, a student who wrote a whole page of the letter “b” ended up with a good score. Other students have figured out that they could do well writing one really good paragraph and just copying that four times to make a five-paragraph essay that scores well. Others have pulled one over on the computer by padding their essays with long quotes from the text they’re supposed to analyze, or from the question they’re supposed to answer.

The science is just not nearly there yet. We can barely determine the sentiment of IMDB reviews. How should we be able to grade an entire essay?

Working in NLP, it really bugs me when people overstate what can be done like this. It creates massive expectations that are bound to fail, and then the entire field suffers.

psandersen · on July 2, 2018

So much this, even if you can create a decently accurate ML system to score student essays it absolutely doesn't handle the adversarial nature of scoring and is frankly insulting given how much students are paying for tuition these days. The problem is AI-complete.

Its a good tool for feedback and its acceptable for a MOOC or low-cost courses as long as there are adequate disclaimers and the student's final score is determined by a properly human graded assignment or exam.

Spooky23 · on July 2, 2018

I used to tutor kids in AP US History and World History when I was a senior and freshman/sophmore in college. The key success factors were getting good scores on two essays and a document-based longer essay where you would use primary source material provided in the test booklet.

These tests were all about test strategy. The written material was graded (in the 90s) by tables of people basically working a checklist for a few different things. If you had insight into how they graded (our teacher had been a grader and had our class start grading each others work when I took the class), you could get a great score even in an area where your knowledge was weak on the essay questions. Another trick was to know 3-4 general concepts from a well regarded historian to support whatever conclusion you cook up.

I think you could score these with NLP with a reduced number of humans to fine tune it. (In the 90s, I think up to 4-5 people would touch a paper) The exam would get dumbed down over time as a result as the test developers would narrow the scope of the exam to make scoring better, but I'm sure that's happening anyway as the testing vendors shave cost.

JustSomeNobody · on July 2, 2018

> Working in NLP, it really bugs me when people overstate what can be done like this.

But they have to. Otherwise funding goes away.

> It creates massive expectations that are bound to fail, and then the entire field suffers.

Yes. But by then the people overstating now will have made their money and moved on.

I agree with you, but honestly companies and investors are so shallow it isn't hard to see what is going on.

crispyambulance · on July 2, 2018

   > Yes. But by then the people overstating now will have made their money and moved on.

Yep, and those people will likely be sending THEIR kids to elite private schools with 5:1 student-teacher ratios and 20-40K/year tuition for elementary and high school. These schools will DEFINITELY NOT use "AI" for grading.

nickserv · on July 2, 2018

It's weird that what is luxury today was once the norm: organic produce, a stay at home parent, natural (as opposed to synthetic) clothing, etc.

I won't be surprised if having human teachers / caregivers will be reserved for the upper classes.

zeth___ · on July 2, 2018

If you think you can get a 5:1 student ratio on 40k a year I have a bridge you can buy. For a 10:1 you'd be looking at 100k.

greenleafjacob · on July 2, 2018

That is fraud.

swebs · on July 2, 2018

>Others have pulled one over on the computer by padding their essays with long quotes from the text they’re supposed to analyze, or from the question they’re supposed to answer.

This is a long standing tradition for students on pretty much every humanities essay.

stakhanov · on July 2, 2018

If you apply to study at the University of Cambridge (UK, not Boston), an actual faculty member (not a teaching assistant or anything like that) of the University of Cambridge (not some lesser institution) will take the time to sit down with you for an extended period of time for an interview, even for undergraduates (for graduates, it goes without saying). That courtesy is extended to a boatload of applicants who make that final stage, and the stages before that certainly don't involve robo-grading.

In the U.S., you take the SAT or GRE or whatever which gets graded by a mechanical turk or even an actual machine. And even the good colleges don't have the good sense to ignore the thing (the exception being MIT, at least back when I applied there, where stating your GRE score was optional when you applied as a grad). So a less than astronomical score will immediately make it impossible to get into any decent college. That system is f*cked up.

For me that meant that the only university that would have me also happened to be the university that routinely scores first or second place in international university rankings, that university being Cambridge. And I didn't get a single offer from the US because of a bad GRE score.

After getting my doctorate in NLP there, I can attest to how fucked up the idea actually is, that you can score essays using NLP.

crankylinuxuser · on July 2, 2018

The sicker idea is this whole business of "We'll use machines to do human work, and badly, cause it saves a significant amount of money", is endemic to the US at large. And sure, it makes a lot of money right now, which is what the stock markets and VC's care about.

I can highlight individual companies (Google, Facebook, Amazon... etc.) and they all use various forms of "AI" (cough) to do all sorts of human related tasks. And they do it somewhat ok for the standard case, but fail horrendously all around the edges. But the failing around the edges is an intended side effect. It costs too much to do it right, so "right now" is what's selected for.

It's also why HN, Twitter, and Facebook is the final customer support solution. That's because their customer support portals are echo chambers of "we're very sorry, but we really dont care".

albntomat0 · on July 2, 2018

This seems easy to game, especially if you have any insight into the model behind the grading. I doubt that those running it will release any specifics, but with a large industry[0] behind test prep, general guidance on how to beat the system will be available for those who can pay for it.

That said, the SAT writing section was equally silly when graded by humans. All of the examples of what a good score looked like (per when I took it ~10 years ago) were highly formulaic, rewarding those who invested in played the game.

[0]: https://www.quora.com/How-big-is-the-SAT-test-prep-market

zawerf · on July 2, 2018

If you read the article it actually goes into a lot of fun details on how it is already being gamed. GAN techniques where you just generate garbage to get a high score:

> Called the Babel ("Basic Automatic B.S. Essay Language") Generator, it works like a computerized Mad Libs, creating essays that make zero sense, but earn top scores from robo-graders.

  "History by mimic has not, and presumably never will be precipitously but blithely ensconced. Society will always encompass imaginativeness; many of scrutinizations but a few for an amanuensis. The perjured imaginativeness lies in the area of theory of knowledge but also the field of literature. Instead of enthralling the analysis, grounds constitutes both a disparaging quip and a diligent explanation."

Along with human examples:

> One year, she says, a student who wrote a whole page of the letter "b" ended up with a good score. Other students have figured out that they could do well writing one really good paragraph and just copying that four times to make a five-paragraph essay that scores well. Others have pulled one over on the computer by padding their essays with long quotes from the text they're supposed to analyze, or from the question they're supposed to answer.

This whole thing is a nice case study for adversarial machine learning.

tomtimtall · on July 2, 2018

Just because the essays are auto graded doesn’t mean no human will ever see them. It takes far far longer to grade an essay than it does to skim 3 sentences randomly and confirm that it’s bullshit and get the student expelled for attempting to cheat.

Drakim · on July 2, 2018

That will stop people from just copy-pasting one statement over and over again. But, it will still create a meta game of trying to write machine learning likable texts as opposed to good texts.

macNchz · on July 2, 2018

It was already a bit of a game in 2005 when I took the ‘new’ SAT which included an essay component. Research from the same professor Perelman quoted in this article had shown that longer essays were graded more favorably (by overworked humans I believe), so I made sure to fill the whole space at the expense of quality writing, and got a perfect score.

https://mobile.nytimes.com/2005/05/04/education/sat-essay-te...

stakhanov · on July 2, 2018

You're totally missing the point. The existence of cheating strategies implies that there's something wrong. The thing that's wrong is not the cheating. What's wrong is that when an examiner has the power to severely diminish a student's prospects for future educational and career development, you would expect that examiner to be someone whose academic standing is way in advance of the student's, not someone whose intelligence is short of human.

albntomat0 · on July 2, 2018

Adversarial ML where the attack has to fit in the mind of the test taker and iteration is slow is an interesting problem.

I can't say that I trust the original humans grading thousands of SAT essays prior to any ML assistance though.

ycombobreaker · on July 2, 2018

> I can't say that I trust the original humans grading thousands of SAT essays prior to any ML assistance though.

I think you are underestimating human beings here. Sure humans would make errors, but literacy and reading comprehension are skills that all of the graders would have. If ML solutions are not recognizing obvious repetition, markov-chain text, excessive excerpts/quotations, or bbb bbbb bbbbbbb bbb bb; then it is likely that they are grading on an entirely different level of interpretation below the humans.

albntomat0 · on July 2, 2018

I was unclear in my previous comment. Humans are fine with recognizing gibberish answers, but I'm skeptical of the ability to differentiate great writing from simple decent.

TheCoelacanth · on July 2, 2018

That's fine for this purpose because even the most exclusive universities can't really expect great writing as a precondition for admission. These AIs can't even distinguish decent writing from complete gibberish.

aikinai · on July 2, 2018

This reminds me of the Versant test, which is a computer-graded test of speaking ability in various languages. A friend of mind had to take it and I quickly realized it was both inaccurate and highly gameable.

The full story is that my friend was very fluent in English (though still far from native) but did very poorly twice. I spent some time reading the company's PR and an interview with the founder and realized that they weren't even trying to grade content, only fluency.

So I advised my friend to just speak fast and constantly, even if it didn't make sense. If your story goes down a dead end, or you realize your last verb didn't match this object, don't pause, don't go back, just keep the words coming out. It worked perfectly and she aced it the third time.

askvictor · on July 2, 2018

If it possible to replicate the grades (i.e. different teachers give the same work the same mark) then it's possible to write rules for the grades. If it's possible to write rules, it eventually gets possible to automate it. In either case it's possible to game the system if you know the rules (as you've said).

The underlying problem is high-stakes testing. Remove that, and you remove the need for formulaic testing, and remove the incentive to game the testing. I'm not suggesting that means dropping rigor and consistency from the curriculum - there are other ways to achieve those, and they involve human effort i.e. talking to people. Which costs time and money.

albntomat0 · on July 2, 2018

Just because humans can replicate it does not mean that the rules can be expressed in any useful way. Plenty of tasks that replicate easily across humans (such as driving) currently are not automated.

I'm not saying it cannot be done, but it definitely is non-trivial.

If the rules become good enough at capturing good writing, then "gaming" them is the same as producing good writing!

lozenge · on July 2, 2018

I'm surprised they didn't even mention the morale issues. How will it feel to know your answer isn't even going to be read by a person?

tomtimtall · on July 2, 2018

Math homework can be auto graded as well. I wouldn’t shred a tear that no one reads whether I got the arithmetic questions right or wrong. The point of all the works is to improve your abilities. Your first short story written at the age of 7 isn’t going to be a blockbuster, it’s just trash you write to improve your ability to write. Using robo grading means more absolutely horrible essays that have no value on their own can be written and graded without a teacher having to spend the huge amounts of time it takes to grade them.

ebiester · on July 2, 2018

The answer is incorrect.

Your answer: 18

The answer: 18

Anyone who has taken an autograded test or two will know it has bugs. Sometimes, they don't trim (see above) or not expressed in a way that a computer determines is equivalent.

In the case in the article, we are talking about mass graded assessments, and we can see that those who have access to wealth also have access to those who can teach them to game the system. For something as small as assessments, it probably doesn't matter on an individual level - that's always been true and one of the ways that wealth keeps its statistical advantage. However, this will also creep into the poorer school districts and as that happens, this gets truly problematic.

While you are right that the point is to improve your ability to write, the feedback loop from a trained professional is a key portion of improvement. Getting students trained by machine feedback will be a further reinforcement of privilege - your writing judged by computers will become a tell-tale sign of your upbringing.

michaelt · on July 2, 2018

You might be right that the writing I did at age 7 was absolutely horrible trash with no value.

But my teachers went out of their way to give me the opposite impression. Why do you think that was?

sct202 · on July 2, 2018

I think that robo-grading is missing the ability to give cohesive feedback on someone's performance. A teacher who reads a students papers over the course of a year (or years) will be able to provide guidance to the student on how to improve. The robo-graders might be able to spit out a semi-accurate grade, but they won't be able to tell the student what they could improve and how.

ycombobreaker · on July 2, 2018

What's more, with black box ML techniques it may be the case that "advice" exists, but that it cannot be effectively interpreted.

JustSomeNobody · on July 2, 2018

> The point of all the works is to improve your abilities.

And why should one try to improve his ability to write?

Writing is for human consumption. Essays should be scored by humans.

JoeAltmaier · on July 2, 2018

Fallacious argument I think.

I use a timer on my jog; a robot is measuring me! Should be a human coach? Dog food is for dogs; dogs should be manufacturing it?

Silly. Emotionally based, and valid for that aspect. But not a good logical argument.

nickserv · on July 2, 2018

Perhaps not logical, but of great emotional importance.

Having kids know no one will read what they have to say is in effect telling them what they do isn't important, isn't worth the time to read...

I remember a few times when I had interesting exchanges with my teachers based on my writing, and this was certainly motivating to me.

It's like when a 3 year old draws something for you: it's objectively terrible, but you appreciate the effort, and want to encourage more of the same. What would a machine think of it?

saint_fiasco · on July 2, 2018

You jog for your own benefit, not for other people, so you can measure your jogging however you like.

Dog food, however, is for the benefit of dogs. At some point in the manufacturing process a dog has to taste the food, because they are supposed to eat it . Or at least a human has to make sure it smells nice, because they are supposed to buy it.

Same with essays. They are for humans to read, so humans have to test them.

jarmitage · on July 2, 2018

If the teachers are using automatic grading, the students should be allowed to use automatic answering.

In fact, why not stay at home and all interaction can take place via Google's Duplex!

frostburg · on July 2, 2018

Stating that multiple choice tests are free from bias is already highly suspect.

The decision-makers on this seems to be both tech illiterate (or with hidden agendas) and hilariously arrogant to the point that they seem to think that epistemology is something that happens to other people.

skywhopper · on July 2, 2018

“they insist, with computers already doing jobs as complicated and as fraught as driving cars, detecting cancer, and carrying on conversations, they can certainly handle grading students' essays.”

Except that computers can do none of those things, and those three things are all much simpler tasks than critically grading a student’s essay.

Also, of course the companies selling these products will say they work great. That’s the very very last person to trust in the conversation.

dogma1138 · on July 2, 2018

Honestly it might be better than the current system.

I graded papers for national high school exams for a few years (non-US) and there was a strict grading guide that you had to follow which took out any and all subjective grading out of the question.

The essay section grading guide was essentially a template that was a must to follow and any deviation would result in panalties.

You had about 30-45min to grade exams that could take 4-6 hours for students to complete and none of the questions were multiple choice they were all long form questions. With ~30 questions per exam, the time included saving the answers in the electronic grading pad and in the later years in the shitty VB application they provided. (Which I ended up hacking and having a spread sheet import function, this got me kicked out of the exam grading because I tampered with the application...).

We are already robograding papers just with mechanical Turks and in a process which is much less accurate which is why we use 2-3 graders per exam.

Until we find a better overal solution for evaluating student performance I don’t see anything wrong with robogradig exams and student work if anything it might give teachers some more free time to actually teach.

abuckenheimer · on July 2, 2018

I think the reality of this is fairly well captured in the 4th to last paragraph

> Even human readers, who may have two minutes to read each essay, would not take the time to fact check those kind of details, he says. "But if the goal of the assessment is to test whether you are a good English writer, then the facts are secondary."

I would go a step further and say the goal of standardized tests is probably not to test if your a good English writer but rather to see if your good at that test. AI has tremendous ability to streamline the grading of standardized tests which are narrowly focused measure and I think that's fine if it comes with proper checks. As many people mention here I don't think AI will be able to objectively measure good general writing and I doubt we'd fully cede artistic evaluation to a computer as a society but I agree with being weary of this.

Standardized tests in general though seem to me like one of the greatest examples of a measure becoming a target and loosing a lot of its effectiveness (Goodhart's law). I can't help but be reminded of Paul Grahm's essay on nerds[1] by this article because the effective point of school from an evaluation standpoint is these tests. However now we've built a computer that grades these test's nearly as well as its human counter parts and its glaringly obvious that this evaluation is tremendously game-able. Which resonates with Paul's argument that: "The problem with most schools is, they have no purpose."

[1] http://paulgraham.com/nerds.html

nlawalker · on July 2, 2018

>> But Nitin Madnani, senior research scientist at Educational Testing Service (ETS), the company that makes the GRE's automated scoring program, says [...] "If someone is smart enough to pay attention to all the things that an automated system pays attention to, and to incorporate them in their writing, that's no longer gaming, that's good writing," he says. "So you kind of do want to give them a good grade."

The presence of this quote in the article makes my day; doubly so given its source. It's a picturesque example of a statement that could conceivably get a 10/10 from a robo-grader based on the language and structure, but the thinking it represents is completely backwards.

JoeAltmaier · on July 2, 2018

Write to your audience! That's always the best rule, and a reasonable metric for 'good writing'. If the audience is a robot...

ourmandave · on July 2, 2018

I heard a recent radio ad pushing a "Work from home as a Medical Coder" training seminar. Sounds dubious.

I wonder if you could create a legitimate service that provides "Work from home as an Essay Grader."

Donzo · on July 2, 2018

I stumbled upon this company at ISTE.

They outsource essay grading, similar to what you are suggesting.

https://www.thegraidenetwork.com/

astura · on July 2, 2018

Pearson has workfrom home test scorerers.

The positions are temporary/seasonal and I believe that the only qualification is a bachelor's degree.

westurner · on July 2, 2018

edX can automate short essay grading with edx/edx-ora2 "Open Response Assessment Suite" [1] and edx/ease "Enhanced AI scoring engine" [2].

1: https://github.com/edx/edx-ora2 2: https://github.com/edx/ease

... I believe there's also a tool for peer feedback.

ghaff · on July 2, 2018

Peer feedback/grading on MOOCs is pretty bad in my experience. There’s too much diversity of skills, language ability, etc. And too many people who bring their own biases and mostly ignore any grading instructions.

Peer discussion and feedback are useful in things like college classes. Much less so with MOOCs.

LarryL · on July 2, 2018

> Marder and Henderson worry robo-graders will just encourage the worst kind of formulaic writing.

> they will quickly learn they can fool the algorithm by using lots of big words, complex sentences, and some key phrases - that make some English teachers cringe.

Spot on! That's already the case with real teachers, with a computer it'll be hundreds of times easier (and horribly counter productive for the students' formation).

Back in the day, when I was a student, we already did a LOT of padding in our essays: writing pages upon pages (length mattered enormously!) of loooong sentences designed to -basically- make the discours look much more clever (and accurate) than it really was. I'm still puzzled why our teachers did not notice the obvious B.S. that it was! Seriously, I managed to pass some classes (university) with almost no real meat in the text. I just managed to write a coherent & logical argument, with a minimum of (vaguely remembered) facts, and that was enough to get a passing grade (but not a great grade of course). Ridiculous.

---

Another fundamental problem is that even teachers don't have the same views upon what is a good essay. For the worst essays, I think they will agree: bad grammar, typos, no paragraphs, no logical structure, etc. But once you enter the average (or more) quality it becomes much more difficult to agree on which text is better! There is a LOT of subjectivity and personal taste, no matter what directives were given from the educational board, so your results would vary a lot depending on your current teacher.

Some teachers for instance are very narrow-minded, they expect students to "regurgitate" their lessons almost word by word, others appreciate a student's attempt to insight & originality / independence of thought. I've been in the french education system for a long time (including 6 years in 3 different universities), and it was shocking to see how much of a grade would depend on the teacher & of their perception of you (because that's also a factor, even if the teachers will swear it's not the case).

Even for technical subjects it's not obvious AT ALL that the teacher is right and rewarding the best answers. Now that I have many years of programming/projects management behind me, I see plainly the mistakes & B.S. that some teachers would give us. And I've had coworkers who were still part-time students explain some of the incredible misconceptions of their teachers, it was almost beyond belief! Those teachers really lived in another world, they had no clue.

Well, at least, if you are graded by a program, you know that you will only have to learn that program's "tastes/patterns", instead of facing the unknown of being graded by a random teacher (who may like your style or not). That's a small consolation.