Most test development begins and ends with the writing of test questions.
For instance, a teacher preparing a quiz or exam for his or her students usually starts not from the type of statistically validated test plan described in the previous post (the type of test plan created to drive development of a content-valid standardized exam), but with a blank sheet of paper.
This is not to say that tests developed without a content map are inherently poor ones. In fact, most of the hundreds of tests I took in K-12 and college were thoughtful and challenging. This is largely due to the fact that teachers have an extremely strong grasp of what they’d like to cover, and many are experienced item writers who may have also had training in testing basics during their undergraduate or graduate education degree programs.
But the type of planning professional test developers use does avoid the problem of basing content decisions on which subjects are most easily tested (vs. what needs to be covered). For instance, many multiple-choice tests tend to lean heavily on assessing an understanding of definitions of terms since vocabulary questions are easy to write. While this is not a problem in and of itself, having some kind of content checklist to work from can help identify subjects that should be included in a test, even if items covering those subjects might take more time to come up with.
Because the high-volume nature of MOOCs, tests associated with such courses lean heavily on traditional “linear” item formats (true/false, multiple choice, multiple-response, matching) that can be easily automated. And, as I mentioned previously, a bulk of the tests I’ve taken as part of a MOOC course seem to consist of repurposed content from the class the MOOC was based on.
Unfortunately, a number of these tests tend to break one or more of the following rules for professional item development (taught to me years ago by one of the most talented professional exam developers in the industry):
- Each item should measure a single, clearly defined objective
- Items should be written in language that is at or below the average reading level of the students targeted by the assessment
- Items should be clearly written, unambiguous and free of cultural bias
- Responses should be of equivalent length and arranged in a logical order (by length, alphabetically, etc.)
- Avoid true/false items or multiple choice items with responses such as “All of the Above,” or “None of the Above”
- Distractors (i.e., incorrect answers) should all be plausible and of equivalent plausibility
- Avoid trick questions or joke responses
- Multiple response items should indicate the number of required correct answers
- Exhibits (additional material needed to answer the question such as an table of data or reading passage) should relate directly to the test question and not provide distracting or extraneous information
- Avoid items that may disclose answers to other items in the same test
To show you one simple example, the following item breaks a couple of important rules:
A binary number consists of what digits?
A. 3
B. 1
C. 2
D. 0
E. 4
First, it’s a multiple-response item (i.e., a multiple-choice question with more than one correct answer) that doesn’t tell you how many answers you need to select, and second, the responses are not arranged in a logical order. In which case, some slight editing is all that is needed to create a properly formatted version of the same item:
Select the two digits that make up a binary number.
A. 0
B. 1
C. 2
D. 3
E. 4
Now technology does add a few confounding factors to this analysis. For instance, some automated testing systems will randomize answers so that two people looking at the same question might see different choices for A, B, C, etc. (which would break the guideline for organizing answers in some logical order, such as alphabetically or by length).
But this just highlights the fact that the list above should be treated as a set of recommended guidelines, rather than unbreakable commandments. And the overriding rule should be that any item type or format you’d never see on a professional test like the SAT (such as a true-false question) should be avoided in any test meant to truly measure student achievement.
While it would be easy to pull specific test questions from my various MOOC courses to highlight when these rules are broken, I’d instead like to take a look at some of the best test items I’ve been given during my Degree of Freedom courses in order to demonstrate what can be done with machine scored tests when quality item writing is made a priority.
ger tielemans says
If you choose a good automatic test system, you can overrule the randomize function on question level. ( mine system does)
if you use a good system (like mine) it follows the convention that a choice with one option will show a set of radio-buttons and a choice where you can choose more alternatives will show a set of checkboxes. (IBM SAA / common user access 1987)