Item analysis is a hot-button topic for social conversation (Okay, maybe just for some people). I thought it might be useful to talk about Classical Test Theory (CTT) and item analysis analytics in a series of blog posts over the next few weeks. This first one today will focus on some of the theory and background of CTT. In subsequent posts on this topic I will lay out a high-level overview of item analysis and then drill down into details. Some other testing theories include Item Response Theory (IRT), which might be fun to talk about in another post (at least fun for me).
CTT is a body of theory and research regarding psychological testing that predicts/explains the difficulty of questions, provides insight into the reliability of assessment scores, and helps us represent what examinees know and can do. In a similar manner to theories regarding weather prediction or ocean current flow, CTT provides a theoretical framework for understanding educational and psychological measurement. The essential basis of CTT is that many questions combine to produce a measurement (assessment score) representing what a test taker knows and can do.
CTT has been around a long time (since the early 20th century) and is probably the most widely used theory in the area of educational and psychological testing. CTT works well for most assessment applications for reasons such as its ability to work with smaller sample sizes (e.g., 100 or less), and that it is relatively simple to compute and understand the statistics.
The general CTT model is based on the notion that the observed score that test takers obtain from assessments is composed of a theoretical un-measurable “true score” and error. Just as most measurement devices have some error inherent in their measurement (e.g., a thermometer may be accurate to within 0.1 degree 9 times out of 10), so too do assessment scores. For example, if a participant’s observed score (what they got reported back to them) on an exam was 86%, their “true score” may actually be between 80% and 92%.
Measurement error can be estimated and relates back to reliability: greater assessment score reliability means less error of measurement. Why does error relate so directly to reliability? Well, reliability has to do with measurement consistency. So if you could take the average of all the scores that a participant obtained–if they took the same assessment an infinite number of times with no remembering effects–this would be a participant’s true score. The more reliability in the measurement the less wildly diverse the scores would be each time a participant took that assessment over eternity. (This would be a great place for an afterlife joke but I digress…)
For a more detailed overview of CTT, that won’t make your lobes fall off, try Chapter 5 in Dr. Theresa Kline’s book, “Psychological Testing: A Practical Approach to Design and Evaluation.”
In my next post I will provide a high-level picture of item analysis to continue this conversation.