MOOCs and Peer Grading - Rubric Scoring

While most professors participating in the MOOC experiment still come from US universities, the student body is global.

This international reach is one of the most celebrated virtues of free online learning, providing opportunities for students in nearly every nation to participate and interact in flat, global classrooms. But this global audience also presents challenges to various MOOC components, including the one I began discussing yesterday: peer grading.

To cite the most obvious example, one of the criteria for peer-grading an essay in my Modernism class is “Exposition” (i.e., quality of writing). But when I read an essay that seems well thought out but confusedly worded, it’s difficult to determine if I’m looking at something that is not well written (thus deserving a low score) vs. a remarkable effort by someone for whom English is not their first language.

If a student identifies him or herself as not being a native English speaker in the body of their essay, I’ve tended to be more lenient when scoring them for exposition. But, as you might guess, few people include such identifying information in their work. And even if they do, how consistent can my scoring be with peer graders who don’t follow a rule that I made up for my own use?

This highlights the bigger issue of inter-rater reliability.

As mentioned yesterday, MOOC peer-review is modeled on standardized essay scoring used to grade exams such as the AP or essay portions of the SAT or ACT.

But when paid graders have to go through thousands of submissions for AP History (for example), they are not simply e-mailed a rubric and a bunch of essays and told to get on with it. Rather, they are all flown into the same location and put through hours or days of training to ensure they are all grading consistently.

This usually includes sharing examples (called exemplars) of essays representing each score on a rubric (giving graders models to work from). It will also include mechanisms for sharing and confirming scores between graders and bringing in additional evaluators to break ties or settle disputes.

The point of all this activity is to squeeze as much inconsistency out of the process as possible so that the major source of subjectivity in a rubric-graded scoring exercise (idiosyncrasies between those doing the grading) is minimized.

Needless to say, no such training or collaboration is available when I’m scoring 3-4 essays from my home in Boston (and applying my own extra rules – such as the non-native English one mentioned above) while someone else is scoring their 3-4 from their villa on the Turkish coast (and applying his or her own idiosyncratic rules as they work).

Now some inter-rater reliability issues are mitigated by the fact that all of us are using the same, relatively simply rubric, the same rubric we’ll be using to score all the essays peer reviewed for this class.

But consistency and simplicity create a different set of issues. For when essays will be scored using the same three criteria time and time again, it’s hard to resist the urge to write the same essay time and time again (essentially replicating a formula that worked – i.e., earned you a high score – previously).

Similarly, it’s only natural that a professor trying to create assignments that will make use of a simple consistent rubric will ask simply worded, similar questions each time they assign an essay (which may explain why all of the questions I’ve been asked to answer in my Modernism essay assignments essentially boil down to: “Compare Author A with Author B” – with the writers being compared being the only variation from one essay to the next).

Now there is still a considerable difference between comparing Kant to Darwin vs. comparing Marx to Virginia Woolf. But I suspect that the relatively high correlation between peer scores and how professors say they would grade the same essay doesn’t take into account the fact that professors would likely vary their essay questions considerably more if classes were smaller (and they had to read and grade each essay).

As a final rubric-dweeby (dweebie?) point, in theory a rubric that allows you to assign scores of 0-3 to three different criteria means total scores should distribute between 0-9. But since 0 scores (at least in my Modernism rubric) are all based on the same “Nothing was submitted” criteria, then the real distribution for all essays that were submitted (which are the only ones being subjected to rubric grading) is 3-9.

And even here, only an unclear, poorly written argument that provides no evidence will earn the lowest score of 3. Meaning even a mediocre, somewhat-coherent argument that includes a sprinkling of semi-relevant quotes can garner the 5 score needed to achieve a passing grade in the course.

This is not to say that assignments and associated rubrics should be made harder for the sole purpose of penalizing students (especially those aforementioned students who are putting in the extra effort to do good work in a language that might be their second or third). But given that even the simple rubric I’ve been describing has enough placeholders to allow scores to spread out across a wider range, it’s worth putting more time into defining those ranges (and thus defining broader scoring categories) in order to give students more room to work (including more room to succeed or fail).

Comments

Paul says

April 27, 2013 at 10:13 am

Peer assessment is, in MOOC terms, a necessary evil; given the need to make qualitative judgements of work completed there is currently no alternative to using the students themselves to complete the evaluations. In practice, peer assessment can work, but is almost always an inferior option for the student being assessed compared to receiving feedback from an experienced teacher. Peers tend to make fairly shallow rubric-driven assessments and well-meaning feedback may be misleading if the student-assessor does not have a firm grip on the material.

- Roger Brown says
  
  July 21, 2013 at 8:19 am
  
  Peer Assessment is an interesting concept, but hardly able to create reliable qualitative judgements.
  
  Experienced teacher assessments are a well known proven basis for making academic judgements.
  
  Otherwise Nobel would probably been criticised for creating ‘explosives’.
  
well... says

June 2, 2013 at 9:04 am

this might be irrelevant, but you should get your articles translated into other languages to reach more people