We have already discussed criterion, construct, and content validity, which are the stanchions of validity in an assessment. We have also talked about new proponents of argument-based validity and the more abstract concept of face validity.
While all of these concepts relate to the validity of the assessment instrument, we must also consider the validity of the research used in assessment management and the validity of the research that an assessment or survey supports.
In their 1963 book, Experimental and Quasi-Experimental Designs for Research, Donald Campbell and Julian Stanley describe two research design concepts: internal validity and external validity.
Internal validity is the idea that observed differences in a dependent variable (e.g. test score) are directly related to an independent variable (e.g., participant’s true ability). External validity refers to how generalizable our results are. For example, would we expect the same results with other samples of participants, other research conditions, or other operational conditions?
The item analysis report, which provides statistics about the difficulty and discrimination of an item, is an example of research that is used for assessment management. Assessments managers often use these statistics to decide if an unscored field test item is fit to become a scored operational item on an assessment.
When we use the item analysis report to decide if the item is worth keeping, we are conducting research. The internal validity of the research may be threatened if something other than participant ability is affecting the item statistics.
For example, I recall a company that field tested two new test forms, and later found out that one participant had been trying to sabotage the statistics by persuading others to purposefully get a low score on the assessment. Fortunately, this person’s online campaign was ineffective, but it is a good example of an event that could have seriously disrupted the internal validity of the item analysis research.
When considering external validity, the most common threat is a non-representative sample. When field testing items for the first time, some assessment managers will find that volunteer participants are not representative of the general population of participants.
In some of my past experiences, I have had samples of field test volunteers who have been either high- ability participants or who are planning to teach a test prep workshop. We would not expect the item statistics from this sample to remain stable when the items go live in the general population.
So how can we control these threats? Try using separate groups of participants so you can compare results. Be consistent in how assessments are administered, and when items are not administered to all participants, make sure they are randomly assigned. Document your sample to demonstrate that it is representative of your participant population, and when possible, try to replicate your findings.