A friend recently gave a presentation at a local Ted event, and while his was my first interface with the whole bean-bag-chair, Google-Glass, clearly-of-West-Coast-origin Ted experience, it did remind me that I still needed to watch this Ted presentation by Daphne Koller, co-founder of Coursera. She gave the talk in 2012, back when MOOC-mania was just getting started and before people realized that schools like Harvard and Stanford giving away their best courses for free represented some hideous nightmare that needed to be stopped at all costs.
Lest you think this is yet another anti-backlash piece, fear not! For I actually just want to use something she said during her talk to hang yet another bit of testing-related dweebiness on you (which should free most everyone from having to read beyond this paragraph).
Some of your still sticking around? OK, let’s do it!
About thirteen minutes into her Ted talk, Koller used a test-related example to illustrate a point she was making about the power of data-driven decision-making that huge MOOC enrollments enabled.
Specifically, she pulled an example from her co-founder, Andrew Ng’s, Machine Learning class where huge numbers were answering the same MOOC testing question incorrectly the exact same way. Pointing out how a problem like this would be hard to discover if just 2-3 kids out of 30 were making it in a classroom-taught version of the course, she then went on to demonstrate how much easier it was to spot a testing-related issue once data from thousands of students became available to plot out on a cool graph.
Now I suspect my reaction to this observation would be similar to the one a PhD in computer science might have if a layman presented a commonplace in the field as some kind of breathtaking discovery. But for we assessment dorks, Coursera’s discovery was simply an example of a distractor analysis based on a pretty sizable (although hardly unprecedented) data set.
For those not in the field (which means everyone but you, Cliff), distractors are the incorrect answers to a test question. In the case of a multiple-choice question, they would be the incorrect choices but in a question requiring numerical input they would be every number in the universe other than the right one.
Whenever a test item fails to perform (based on the statistics I described back in this piece), there is usually a reason. Sometimes a question is so easy that 100% of test takers get it right. But if the problem is that lots and lots of people are getting it wrong, that might mean it’s a good but difficult question, or it might mean there’s something wrong with it.
Distractor analysis can help determine if a question is problematical by analyzing how many people are answering each of the wrong answers. Ideally, the number of people who get a four-option multiple-choice question wrong should divide evenly between the three incorrect distractors. But if this distribution is lopsided (for instance, if everyone getting it wrong is picking the same distractor), that might mean an answer that is supposed to be wrong is actually right (or at least no less plausible than the correct answer). And if results show that 50% of people pick the right answer and 50% pick the same wrong answer, this could mean two of your distractors are so obviously wrong that your four-option multiple-choice question really just has two choices worth picking from.
In the case of that Machine Learning question (which required students to input two numbers), a sizable percentage of people were inputting the same incorrect pair of values.
Theoretically, this could be the result of collusion (not unfeasible in a MOOC course where students share information in forums and testing is done behind closed doors). Alternatively, the two wrong numbers might be the result of a typical mistake one would make when solving this kind of problem, which means the question is still good since the commonly selected wrong number pair simply represents a reasonable error students need to learn not to make. But there is also the chance that confusion in the way the question is worded is steering people towards this particular incorrect submission, which means the question wording (and possibly programming) should be given a second look.
In short, there’s a lot of insight to be gained by looking at some fairly easy-to-calculate statistics that can be drawn from the very data Coursera used to generate those Ted slides.
But there’s a catch. (Isn’t there always?) For the types of analyses one can do on test data work best when assessments have been given in controlled environments, or at least environments where you can count on your test takers as having had roughly the same testing experience. But if some students are engaged in MOOC testing by themselves and others are taking it in groups, or if some students struggle with the material for reasons having nothing to do with the content (such as language challenges), these inconsistencies become confounding factors when trying to get a grip on what the data is telling you.
And when you’ve got this MOOC that gives you 100 chances to get a set of four-option multiple-choice questions right or that one that uses the same items in both in-video questions and final exams, when some tests consist of just 2-3 items and others have so many true-false questions that flipping a coin can get you close to a passing grade, then fancy-pants statistical analysis is not required to explain why there seems to be no rhyme or reason to student performance since it’s simply a result of MOOC testing that bases a final grade on assessments that measure nothing.