Posted By Doug Peterson
In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.
Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.
Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:
- An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.
- If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.
- If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.
Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.
- Difficulty index: P value = % who answered correctly
- Too high = too easy
- Too low = too hard, confusing or misleading, problem with content or instruction
- Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not
- Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam
- Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right
- +0.10 or above is typically required to keep an item
Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)
- Reliability Coefficient: range of 0 – 1.00
- Acceptable value depends on consequences of testing error
- If failing means having to take some training again, a lower value might be acceptable
- If failing means the health and safety of coworkers might be in jeopardy, a high value is required
There are a number of different types of consistency:
- Test – Retest: repeatability of test scores with the passage of time
- Alternate / Parallel Form: consistency of score across two or more forms by same test taker
- Inter-Rater: consistency of test score when rated by different raters
- Internal Consistency: extent to which items on a test measure the same thing
- Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha
- Items must be single answer (right/wrong)
- May be low if test measures several different, unrelated objectives
- Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content
- Mastery Classification Consistency
- Criterion-referenced tests
- Not affected by items measuring unrelated items
- 3 common measures:
- Phi coefficient
- Agreement coefficient
- For more information, see Criterion-Referenced Test Development by Shrock and Coscarelli
Doug will share these and other best practices for test design and delivery at the Questionmark Users Conference in Baltimore March 3 -6. The program includes an optional pre-conference workshop on Criterion-Referenced Test Development led by Sharon Shrock and Bill Coscarelli. Click here for conference and workshop registration.