In this study, we compared estimates of conditional reliability of 32 Swiss high-stakes medical exams in both Classical Test Theory (CTT) and Item Response Theory (IRT), with a special focus on the cut score and factors influencing the conditional reliability at the cut score. The first research question was whether previous findings regarding the areas with high and low precision in CTT and IRT can be replicated. As anticipated, we found that conditional reliability behaves in an inverse manner in the two theories. The second research question focused on how the conditional reliability at the cut score compares between the two theories. At the cut score, IRT showed higher conditional reliability compared with CTT, and the difference was statistically significant. Third, we analyzed whether the range of examinees’ performance, year of study and number of items influence conditional reliability in the two theories. We found that conditional reliability dropped as a function of the observed range of examinees’ scores and number of items in both IRT and CTT. The range of scores and year of study were highly correlated (r = −0.78). This decrease in the magnitude of the estimates was more pronounced in CTT. The medical schools and the amount of MTF items did not influence the results.
As expected, we found differences between conditional reliability as estimated in CTT and IRT. Across exams, conditional reliability was at its maximum for the very high and the very low scores in CTT, whereas in IRT, conditional reliability was at its minimum for the very high scores.
In contrast to previous findings , we did not find extremely low reliability (i.e. <0.70) for the very low scores in IRT. This might be due to the restricted range of examinees’ scores, as all candidates were well prepared with a small percentage failing the exam and no examinees receiving zero points. Furthermore, measurement precision in IRT is dependent on the characteristics of the items included in the test. All exams in this study included a number of easy items in the exams which provide information (and thereby measurement precision) at the lower end of the ability continuum. Similar to Raju et al. , we found comparably low conditional reliability in IRT for the very high scores.
Conditional reliability at the cut score
At the cut score, estimates of conditional reliability were higher in IRT compared with CTT, a difference that was statistically significant. Indeed, in IRT, reliability at the cut score was above 0.8 for 97% of the exams, while this was only the case for 30% of the exams when a CTT framework was employed. This result can be expected, since cut scores lay, on average, at a percentage-correct score of 56.7%. As delineated above, we observed the highest estimates of conditional reliability for IRT in exactly this range of test scores. This means that depending on the theory applied, rather different conclusions might be drawn on whether a sufficient level of measurement precision for making defensible pass-fail decisions has been reached. This finding might also have relevant practical implications, which will be addressed below.
With regard to influencing variables, we analyzed the range of examinees’ performance, year of study and number of items. We found that range of examinees performance and year of study were correlated ( r = 0.78), which demonstrates that cohorts indeed become more homogeneous as they progress through their studies. The smaller the range of examinees’ performance, the smaller the measurement precision at the cut score. The effect was more pronounced in CTT. This finding is in line with the literature considering estimates in IRT as independent of characteristics of the sample, whereas in CTT estimates, sample characteristics affect test statistics . In CTT, conditional reliability at the cut score fell as low as 0.56 for very homogeneous groups. The second analyzed variable was the number of items, which also showed a significant influence on conditional reliability at the cut score. In both theories, a higher number of items led to higher conditional reliability at the cut score.
We included the medical school and the percentage of MTF items as control variables. These two variables did not affect the results. This shows that results are comparable in the three different schools and thereby they might also be transferable to other medical schools. The included exams consisted of both Type A and MTF items. MTF items are not the most commonly used type of items. We could show that the percentage of MTF items included did not influence the results. However, the amount of MTF items ranged between 18.97% and 53.33%. None of the included exams consisted only of Type A items. However, results regarding the distribution of conditional reliability were similar to those of Raju et al.  who used ‘dichotomously scored multiple choice items’. Therefore, we assume that results would be similar when using exams consisting of Type A items only. However, further research on this topic is needed.
To our knowledge, this is the first study to analyze conditional reliability in medical education assessment as well as potential influencing factors. Moreover, the study included a large sample of high-stakes medical education assessments with content-based cut scores and high-quality control and compared these aspects in two relevant psychometric theories. The sample included exams conducted at three different Swiss medical schools and represented all years of study.
The study included 32 high-stakes medical education exams. As all of these exams were end-of-term assessments with the aim to establish minimum competency, the assessments had similar characteristics. All cut scores were established in a content-based manner and ranged around 55%. All exams included large numbers of items. The results might differ for exams with small samples or different cut scores.
Discussions about which theory to use in medical education assessment are still ongoing. Various studies comparing the practical implications of IRT and CTT found that many indices such as item difficulty, discrimination, global reliability and estimates of examinees’ ability are highly correlated [14, 31,32,33,34]. In this study, however, we demonstrated that regarding the concept of measurement precision, there is a noteworthy difference between IRT and CTT in terms of estimates of conditional reliability at the cut score. In addition, our results highlight that conditional reliability in IRT is more consistent across exams than in CTT. In particular, estimates based on IRT were less affected by decreasing between-person differences.
The finding that IRT and CTT lead to rather different estimates of conditional reliability at the cut score raises the question of which theory should be used under which conditions. While a thorough discussion of this topic is beyond the scope of the present paper, we argue that choosing a psychometric approach merely based on which provides higher estimates would be a dubious practice. However, we believe that IRT seems to provide a number of important features that do not easily translate into CTT. We will briefly discuss three noteworthy features of IRT below.
First, an intriguing feature of IRT is that it readily provides the basis for criterion-referenced interpretations of test scores; because both items and persons are explicitly linked to each other, the likelihood of answering an item correctly is a direct function of characteristics of the item and the examinee’s ability [14, 35]. As the aim of most exams in medical education within competency-based assessment is to ensure minimal ability, a criterion-based standard setting is commonly used . Here, IRT offers a good fit for medical education assessments. Second, from a more technical perspective, IRT can be used for analyzing categorical data, which constitute the most common type of data in medical education assessment as items are mostly answered either correctly or incorrectly . Third, from a conceptual point of view, IRT might be a more adequate fit for modeling the response process in typical clinical scenarios, since it conceives of the relation between ability and success on an item as an inherently stochastic process. This is an important conceptual feature, since more recent accounts for understanding the process of diagnostic inference and decision-making argue for the ‘probabilistic nature of diagnostic inference’  and describe the physician as being situated in a probabilistic environment. If such a probabilistic environment can legitimately be assumed, methods developed within IRT may theoretically be an appropriate fit to model the process of responding to tasks and items in assessments in medical education. While the discussion on how and why to employ a specific psychometric framework warrants debate and should be looked at in more detail, we nevertheless believe that there are a number of reasonable arguments for opting for an IRT framework for typical medical education assessments, where minimal competency is crucial and criterion standard-setting is applied. Using IRT and thus conditional reliability in IRT to ensure measurement precision of pass-fail decisions may have practical implications for quality assurance and assessment design. As shown in our study, the number of items influences conditional reliability at the cut score, and even exams with a small number of items showed high conditional reliability (<0.8) in IRT. These findings indicate that using the concept of conditional reliability in IRT could inform exam design, for example by allowing for a smaller number of items if this is possible according to the blueprint. In terms of quality assurance, tests could be designed mainly comprising items that offer relevant information at the cut score. Thus, conditional reliability at the cut score could be increased and the overall number of items could be reduced.