Commentary

Robots May Be Grading Your Kid’s Essays, With Bias

November 12, 2019

Rich Barlow

Students at Albany School of Humanities wait to board a bus after school on Monday, April 13, 2015, in Albany, N.Y. With standardized English tests set to begin Tuesday in New York schools, some parents are again planning to have their children sit out the exams. (Mike Groll/AP)

My favorite poster depicts a robot atop the single word OBEY. But while I chuckle at this riff on sci-fi lore, it probably doesn’t amuse those haunted by the notion of a real-life Robocalypse, the mass layoffs allegedly bearing down on humanity Terminator-like due to automation.

This dystopian fear seems mainly to afflict tech types who’ve OD’d on “Star Trek” episodes; there’s good evidence that bots aren’t about to replace everyone on the payroll just yet. But they are taking over one job, and not for the better: grading our children's tests, from elementary school to graduate school entrance exams.

Artificial intelligence isn’t evaluating just straightforward multiple-choice tests but essays, determining whether you write with the verve of Joan Didion and the unassailable logic of a Cognoscenti contributor. “From the fall of the Roman Empire to the pros and cons of government regulations,” NPR reports, robo-readers are weighing in on students’ insights.

Technology has been a boon to anyone who does research, putting a universe of information at our fingertips. But for connoisseurs of literate, thoughtful writing, it can be a soul-seducing succubus. For all the fears of automatons outperforming humans, robo-graders can’t tell a Shakespeare from a schmuck.

“Flawed Algorithms Are Grading Millions of Students’ Essays,” trumpeted a recent story by the tech news site Motherboard, citing the use of AI as “the primary or secondary grader on standardized tests in 21 states.” (Tests increasingly are taken online rather than with paper and pencil, allowing auto-grading.) In 18 of the 21, only a fraction of essays have human readers double-checking the AI assessment, the latter not only spitting out test results in mere minutes but being cheaper than flesh-and-blood back-up.

That AI graders are suckers for big words is, in a weird way, related to their second flaw: They can be biased against certain groups.

But you get what you pay for, and AI grading is flawed for two reasons.

First, robots don’t always know good writing when they see, er, scan it. Much as slinging big words impresses less literate people, some AI grading systems “can be fooled by nonsense essays with sophisticated vocabulary,” wrote Motherboard’s reporter, who ran two MIT-generated essays, deliberately laden with BS, by the AI grader for the online practice tool of the Graduate Record Examination, or GRE.

Both essays scored 4 out of a possible 6, indicating “competent examination of the argument and convey[ing] meaning with acceptable clarity.” Here’s the competent clarity with which one of those essays opened: “Invention for precincts has not, and presumably never will be undeniable in the extent to which we inspect the reprover.” It was downhill from there.

In Utah, which has bought into robo-grading big time, one student filled an entire essay page with the letter “b” and earned a good score, NPR reported.

That AI graders are suckers for big words is, in a weird way, related to their second flaw: They can be biased against certain groups.

For example, the Educational Testing Service has done multiple studies of its robotic “E-rater,” used on tests including the GRE, an admissions tool for graduate study. The studies confirmed that E-rater gave higher scores to test takers from China than did human graders, while giving lower scores to African Americans.

The reason? Chinese students got graded up for essay length and sophisticated vocabulary, while getting lower grades on grammar and sentence mechanics. Combined, those results suggested to researchers that “many students from mainland China were using significant chunks of pre-memorized shell text,” Motherboard reported. “African Americans, meanwhile, tended to get low marks from E-rater for grammar, style, and organization—a metric closely correlated with essay length—and therefore received below-average scores. But when expert humans graded their papers, they often performed substantially better.”

Bots’ bias, it turns out, comes from—us. Artificial intelligence isn’t intelligent enough to judge writing quality; rather, it’s fed hundreds of example essays and correlate those essays with high or low grades assigned by actual human graders. “They then predict what score a human would assign an essay, based on those patterns,” Motherboard’s reporter wrote. The problem, a computer scientist told the reporter, “is that bias is another kind of pattern” that an AI grader can absorb.

ETS says its essays are double-checked, and sometimes triple-checked, by human graders. Yet one ETS researcher concedes that algorithms teach you to write big words rather than to write well. And as mentioned, on other standardized tests, many states have few humans reading over the robots’ shoulders.

Education is hardly the only arena where algorithmic bias rears itself. African Americans can receive inferior medical care because software that’s programmed to correlate high medical spending with risk of serious illness doesn’t know that low-spending blacks often are medically underserved, and thus misses their risk.

“It is easier to remove biases from algorithms than from people,” says one computer scientist. His point about human fallibility is indisputable, his faith in algorithms less so. Unquestionably, they’re economical, high-performing grading tools in certain contexts. But even granting that science will improve machine graders, will they ever be able to judge good and creative writing—or truth?

A San Francisco tutoring company head told NPR that he advises students to make up evidence from fake experts (some use their roommate’s name) to appease the machine, which often returns a good grade. “Yeah, we see a lot of that,” an ETS researcher conceded, but since human graders might not have time to do fact-checking either, “it’s not the end of the world.”

Maybe not, but in the age of Trump, it could be the end of reasoned, factual writing. Could it be our honesty and devotion to fact, not our jobs, that the Robocalypse will take?

Follow Cognoscenti on Facebook and Twitter.