Posted by Austin Fossey
A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.
At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.
But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.
Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:
- Remove items that are not relevant to the domain.
- Ensure that the blueprint is covered.
- Check that items are assigned to the correct topic.
If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.
For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.
The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:
“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)
Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.
In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.
This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.
We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.