Distractor Analysis - MOOC Testing

A friend recently gave a presentation at a local Ted event, and while his was my first interface with the whole bean-bag-chair, Google-Glass, clearly-of-West-Coast-origin Ted experience, it did remind me that I still needed to watch this Ted presentation by Daphne Koller, co-founder of Coursera. She gave the talk in 2012, back when MOOC-mania was just getting started and before people realized that schools like Harvard and Stanford giving away their best courses for free represented some hideous nightmare that needed to be stopped at all costs.

Lest you think this is yet another anti-backlash piece, fear not! For I actually just want to use something she said during her talk to hang yet another bit of testing-related dweebiness on you (which should free most everyone from having to read beyond this paragraph).

Some of your still sticking around? OK, let’s do it!

About thirteen minutes into her Ted talk, Koller used a test-related example to illustrate a point she was making about the power of data-driven decision-making that huge MOOC enrollments enabled.

Specifically, she pulled an example from her co-founder, Andrew Ng’s, Machine Learning class where huge numbers were answering the same MOOC testing question incorrectly the exact same way. Pointing out how a problem like this would be hard to discover if just 2-3 kids out of 30 were making it in a classroom-taught version of the course, she then went on to demonstrate how much easier it was to spot a testing-related issue once data from thousands of students became available to plot out on a cool graph.

Now I suspect my reaction to this observation would be similar to the one a PhD in computer science might have if a layman presented a commonplace in the field as some kind of breathtaking discovery. But for we assessment dorks, Coursera’s discovery was simply an example of a distractor analysis based on a pretty sizable (although hardly unprecedented) data set.

For those not in the field (which means everyone but you, Cliff), distractors are the incorrect answers to a test question. In the case of a multiple-choice question, they would be the incorrect choices but in a question requiring numerical input they would be every number in the universe other than the right one.

Whenever a test item fails to perform (based on the statistics I described back in this piece), there is usually a reason. Sometimes a question is so easy that 100% of test takers get it right. But if the problem is that lots and lots of people are getting it wrong, that might mean it’s a good but difficult question, or it might mean there’s something wrong with it.

Distractor analysis can help determine if a question is problematical by analyzing how many people are answering each of the wrong answers. Ideally, the number of people who get a four-option multiple-choice question wrong should divide evenly between the three incorrect distractors. But if this distribution is lopsided (for instance, if everyone getting it wrong is picking the same distractor), that might mean an answer that is supposed to be wrong is actually right (or at least no less plausible than the correct answer). And if results show that 50% of people pick the right answer and 50% pick the same wrong answer, this could mean two of your distractors are so obviously wrong that your four-option multiple-choice question really just has two choices worth picking from.

In the case of that Machine Learning question (which required students to input two numbers), a sizable percentage of people were inputting the same incorrect pair of values.

Theoretically, this could be the result of collusion (not unfeasible in a MOOC course where students share information in forums and testing is done behind closed doors). Alternatively, the two wrong numbers might be the result of a typical mistake one would make when solving this kind of problem, which means the question is still good since the commonly selected wrong number pair simply represents a reasonable error students need to learn not to make. But there is also the chance that confusion in the way the question is worded is steering people towards this particular incorrect submission, which means the question wording (and possibly programming) should be given a second look.

In short, there’s a lot of insight to be gained by looking at some fairly easy-to-calculate statistics that can be drawn from the very data Coursera used to generate those Ted slides.

But there’s a catch. (Isn’t there always?) For the types of analyses one can do on test data work best when assessments have been given in controlled environments, or at least environments where you can count on your test takers as having had roughly the same testing experience. But if some students are engaged in MOOC testing by themselves and others are taking it in groups, or if some students struggle with the material for reasons having nothing to do with the content (such as language challenges), these inconsistencies become confounding factors when trying to get a grip on what the data is telling you.

And when you’ve got this MOOC that gives you 100 chances to get a set of four-option multiple-choice questions right or that one that uses the same items in both in-video questions and final exams, when some tests consist of just 2-3 items and others have so many true-false questions that flipping a coin can get you close to a passing grade, then fancy-pants statistical analysis is not required to explain why there seems to be no rhyme or reason to student performance since it’s simply a result of MOOC testing that bases a final grade on assessments that measure nothing.

Comments

Raja says

December 2, 2013 at 4:27 pm

Dear Jonathan,

Very informative post and the conclusion you draw in your last paragraph is very sound except that in keeping the number of tries limited during Quizzes or assessments rather than ‘any number of attempts till you get the answer right’ you are probably taking away one of the prime motivators for MOOC students to take the class.

Basically, there is no sense of failure in such a system which is independent of the number of attempts on any question. Take that away and the result is the same old ‘you loose’ if you are wrong mentality or something like that, which can be quite dissuading for some people as the emphasis is then on getting everything right, rather than learning everything the right way (as in cool, calm and relaxed).

Would love to hear your thoughts on this?

P.S: Keep up the good work with degreeoffreedom!

- DegreeofFreedom says
  
  December 2, 2013 at 6:48 pm
  
  My attitude towards testing is that it can (and should) be put to multiple uses within any course.
  
  For instance, the type of testing that gives everyone the opportunity to get 100% of the questions right can form an important part of the learning process IF (1) the questions are well designed and challenging and (2) they include an instructional component to help students learn something on their way to getting an answer right (vs. just giving them enough opportunities to guess their way to a perfect score. But assessments with any stakes (i.e., those meant to separate those who have mastered the material vs. those who have not) will always have winners (the people who learn the material enough to pass a test) and losers (those who didn’t master the material and thus fail).
  
  Their high stakes nature is the reason tests that separate one group from another have a special obligation to be well designed, well-written and meaningful. And I don’t see any reason why a professor can’t make their own choices over how much testing, and what kind of testing, they would like to include in their courses. But as much as I hate the notion of someone who has put time into a class being disappointed, I think the educational value of the course overall is diminished for everyone if it’s possible to pass the course without learning the material.
  
  Thanks for the kind words and now it’s back to class!