Advances in Health Sciences Education

, Volume 23, Issue 1, pp 217–232 | Cite as

Why assessment in medical education needs a solid foundation in modern test theory

  • Stefan K. SchauberEmail author
  • Martin Hecht
  • Zineb M. Nouns


Despite the frequent use of state-of-the-art psychometric models in the field of medical education, there is a growing body of literature that questions their usefulness in the assessment of medical competence. Essentially, a number of authors raised doubt about the appropriateness of psychometric models as a guiding framework to secure and refine current approaches to the assessment of medical competence. In addition, an intriguing phenomenon known as case specificity is specific to the controversy on the use of psychometric models for the assessment of medical competence. Broadly speaking, case specificity is the finding of instability of performances across clinical cases, tasks, or problems. As stability of performances is, generally speaking, a central assumption in psychometric models, case specificity may limit their applicability. This has probably fueled critiques of the field of psychometrics with a substantial amount of potential empirical evidence. This article aimed to explain the fundamental ideas employed in psychometric theory, and how they might be problematic in the context of assessing medical competence. We further aimed to show why and how some critiques do not hold for the field of psychometrics as a whole, but rather only for specific psychometric approaches. Hence, we highlight approaches that, from our perspective, seem to offer promising possibilities when applied in the assessment of medical competence. In conclusion, we advocate for a more differentiated view on psychometric models and their usage.


Measurement Error Assessment Medical competence Post-psychometric era Case specificity Latent variables 


  1. Bartroff, J., Lai, T. L., & Shih, M.-C. (2013). Sequential experimentation in clinical trials. New York, NY: Springer.CrossRefGoogle Scholar
  2. Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software. doi: 10.18637/jss.v067.i01.Google Scholar
  3. Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305.CrossRefGoogle Scholar
  4. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. doi: 10.1007/s11336-006-1447-6.CrossRefGoogle Scholar
  5. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110, 203–219. doi: 10.1037/0033-295X.110.2.203.CrossRefGoogle Scholar
  6. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi: 10.1037/0033-295X.111.4.1061.CrossRefGoogle Scholar
  7. Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi: 10.1111/j.1365-2923.2011.04075.x.CrossRefGoogle Scholar
  8. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.CrossRefGoogle Scholar
  9. Colliver, J. A., Markwell, S. J., Vu, N. V., & Barrows, H. S. (1990). Case specificity of standardized-patient examinations: Consistency of performance on components of clinical competence within and between cases. Evaluation & the Health Professions, 13, 252–261. doi: 10.1177/016327879001300208.CrossRefGoogle Scholar
  10. Cook, D. A., Kuper, A., Hatala, R., & Ginsburg, S. (2016). When assessment data are words: Validity evidence for qualitative educational assessments. Academic Medicine. doi: 10.1097/ACM.0000000000001175.Google Scholar
  11. Cooksey, R. W. (1996). The methodology of social judgement theory. Thinking & Reasoning, 2, 141–174. doi: 10.1080/135467896394483.CrossRefGoogle Scholar
  12. Cronbach, L. J., & Shavelson, R. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. doi: 10.1177/0013164404266386.CrossRefGoogle Scholar
  13. Crossley, J. G. M. (2010). Vive la difference! A recall from knowing to exploring. Medical Education, 44, 946–948. doi: 10.1111/j.1365-2923.2010.03786.x.CrossRefGoogle Scholar
  14. Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.CrossRefGoogle Scholar
  15. De Champlain, A., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklist recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74(10), S52–S54.CrossRefGoogle Scholar
  16. Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007). Estimating the multilevel Rasch model: With the lme4 package. Journal of Statistical Software. doi: 10.18637/jss.v020.i02.Google Scholar
  17. Dory, V., Gagnon, R., & Charlin, B. (2010). Is case-specificity content-specificity? An analysis of data from extended-matching questions. Advances in Health Science Education, 15, 55–63. doi: 10.1007/s10459-009-9169-z.CrossRefGoogle Scholar
  18. Driessen, E., van der Vleuten, C. P. M., Schuwirth, L., van Tartwijk, J., & Vermunt, J. (2005). The use of qualitative research criteria for portfolio assessment as an alternative to reliability evaluation: A case study. Medical Education, 39, 214–220. doi: 10.1111/j.1365-2929.2004.02059.x.CrossRefGoogle Scholar
  19. Durning, S. J., Artino, A. R., Boulet, J. R., Dorrance, K., van der Vleuten, C. P. M., & Schuwirth, L. (2012). The impact of selected contextual factors on experts’ clinical reasoning performance (does context impact clinical reasoning performance in experts?). Advances in Health Science Education, 17, 65–79. doi: 10.1007/s10459-011-9294-3.CrossRefGoogle Scholar
  20. Edwards, J. R. (2011). The fallacy of formative measurement. Organizational Research Methods, 14, 370–388. doi: 10.1177/1094428110378369.CrossRefGoogle Scholar
  21. Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5(2), 155–174.CrossRefGoogle Scholar
  22. Elstein, A. S. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard Univ. Press.CrossRefGoogle Scholar
  23. Eva, K. W. (2003). On the generality of specificity. Medical Education, 37, 587–588. doi: 10.1046/j.1365-2923.2003.01563.x.CrossRefGoogle Scholar
  24. Eva, K. (2011). On the relationship between problem-solving skills and professional practice. In C. Kanes (Ed.), Elaborating professionalism (Vol. 5, pp. 17–34, Innovation and change in professional education). Dordrecht: Springer.Google Scholar
  25. Eva, K. W., & Hodges, B. D. (2012). Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Medical Education, 46, 914–919. doi: 10.1111/j.1365-2923.2012.04310.x.CrossRefGoogle Scholar
  26. Evans, J. S. B. T., Clibbens, J., Cattani, A., Harris, A., & Dennis, I. (2003). Explicit and implicit processes in multicue judgment. Memory & Cognition, 31, 608–618. doi: 10.3758/BF03196101.CrossRefGoogle Scholar
  27. Gick, M. L., & Holyoak, K. J. (1980). Analogical problem solving. Cognitive Psychology, 12, 306–355. doi: 10.1016/0010-0285(80)90013-4.CrossRefGoogle Scholar
  28. Godden, D. R., & Baddeley, A. D. (1975). Context-dependent memory in two natural environments: On land and underwater. British Journal of Psychology, 66, 325–331. doi: 10.1111/j.2044-8295.1975.tb01468.x.CrossRefGoogle Scholar
  29. Goldberg, L. R. (1970). Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. Psychological Bulletin, 73, 422–432. doi: 10.1037/h0029230.CrossRefGoogle Scholar
  30. Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5, 211–220. doi: 10.2307/1501031.CrossRefGoogle Scholar
  31. Goldstein, H. (2012). Francis Galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2), 147–158.CrossRefGoogle Scholar
  32. Goodwin, D. W., Powell, B., Bremer, D., Hoine, H., & Stern, J. (1969). Alcohol and recall: State-dependent effects in man. Science, 163, 1358–1360. doi: 10.1126/science.163.3873.1358.CrossRefGoogle Scholar
  33. Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24, 23–37. doi: 10.1080/02680930802412669.CrossRefGoogle Scholar
  34. Hammond, K. R., Hamm, R. M., Grassia, J., & Pearson, T. (1987). Direct comparison of the efficacy of intuitive and analytical cognition in expert judgment. IEEE Transactions on Systems, Man, and Cybernetics, 17, 753–770. doi: 10.1109/TSMC.1987.6499282.CrossRefGoogle Scholar
  35. Hammond, K. R., Hursch, C. J., & Todd, F. J. (1964). Analyzing the components of clinical inference. Psychological Review, 71, 438–456. doi: 10.1037/h0040736.CrossRefGoogle Scholar
  36. Hecht, M., Weirich, S., Siegle, T., & Frey, A. (2015). Modeling booklet effects for nonequivalent group designs in large-scale assessment. Educational and Psychological Measurement, 75, 568–584. doi: 10.1177/0013164414554219.CrossRefGoogle Scholar
  37. Hertwig, R., Meier, N., Nickel, C., Zimmermann, P.-C., Ackermann, S., Woike, J. K., et al. (2013). Correlates of diagnostic accuracy in patients with nonspecific complaints. Medical Decision Making : An International Journal of the Society for Medical Decision Making, 33, 533–543. doi: 10.1177/0272989X12470975.CrossRefGoogle Scholar
  38. Hodges, B. (2013). Assessment in the post-psychometric era: Learning to love the subjective and collective. Medical Teacher, 35, 564–568. doi: 10.3109/0142159X.2013.789134.CrossRefGoogle Scholar
  39. Jarjoura, D., Early, L., & Androulakakis, V. (2004). A multivariate generalizability model for clinical skills assessments. Educational and Psychological Measurement, 64, 22–39. doi: 10.1177/0013164403258466.CrossRefGoogle Scholar
  40. Jones, P., Smith, R. W., & Talley, D. (2006). Developing test forms for small-scale achievement testing systems. In S. M. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 487–525). New York, NY: L. Erlbaum Associates.Google Scholar
  41. Kane, M. (1996). The precision of measurements. Applied Measurement in Education, 9, 355–379. doi: 10.1207/s15324818ame0904_4.CrossRefGoogle Scholar
  42. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi: 10.1111/jedm.12000.CrossRefGoogle Scholar
  43. Karelaia, N., & Hogarth, R. M. (2008). Determinants of linear judgment: A meta-analysis of lens model studies. Psychological Bulletin, 134, 404–426. doi: 10.1037/0033-2909.134.3.404.CrossRefGoogle Scholar
  44. Kaufmann, E., & Athanasou, J. A. (2009). A meta-analysis of judgment achievement as defined by the lens model equation. Swiss Journal of Psychology, 68, 99–112. doi: 10.1024/1421-0185.68.2.99.CrossRefGoogle Scholar
  45. Keller, L. A., Clauser, B. E., & Swanson, D. B. (2010). Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Advances in Health Science Education, 15, 717–733. doi: 10.1007/s10459-010-9233-8.CrossRefGoogle Scholar
  46. Kotovsky, K., Hayes, J., & Simon, H. (1985). Why are some problems hard? Evidence from Tower of Hanoi. Cognitive Psychology, 17, 248–294. doi: 10.1016/0010-0285(85)90009-X.CrossRefGoogle Scholar
  47. Kreiter, C. (2008). A comment on the continuing impact of case specificity. Medical Education, 42, 548–549. doi: 10.1111/j.1365-2923.2008.03085.x.CrossRefGoogle Scholar
  48. Kreiter, C. D., & Bergus, G. R. (2007). Case specificity: Empirical phenomenon or measurement artifact? Teaching and Learning in Medicine, 19, 378–381. doi: 10.1080/10401330701542776.CrossRefGoogle Scholar
  49. Larson, J. S., & Billeter, D. M. (2016). Adaptation and fallibility in experts’ judgments of novice performers. Journal of Experimental Psychology. Learning, Memory, and Cognition.. doi: 10.1037/xlm0000304.Google Scholar
  50. Leight, K. A., & Ellis, H. C. (1981). Emotional mood states, strategies, and state-dependency in memory. Journal of Verbal Learning and Verbal Behavior, 20, 251–266. doi: 10.1016/S0022-5371(81)90406-0.CrossRefGoogle Scholar
  51. Marcoulides, G. A. (1996). Estimating variance components in generalizability theory: The covariance structure analysis approach. Structural Equation Modeling: A Multidisciplinary Journal, 3, 290–299. doi: 10.1080/10705519609540045.CrossRefGoogle Scholar
  52. Mattick, K., Dennis, I., Bradley, P., & Bligh, J. (2008). Content specificity: Is it the full story? Statistical modelling of a clinical skills examination. Medical Education, 42, 589–599. doi: 10.1111/j.1365-2923.2008.03020.x.CrossRefGoogle Scholar
  53. McClelland, D. C. (1973). Testing for competence rather than for intelligence. American Psychologist, 28(1), 1–14.CrossRefGoogle Scholar
  54. Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299. doi: 10.1037/1082-989X.1.3.293.CrossRefGoogle Scholar
  55. Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. doi: 10.3102/0013189X018002005.CrossRefGoogle Scholar
  56. Norcini, J., Anderson, B., Bollela, V., Burch, V., Costa, M. J., Duvivier, R., et al. (2011). Criteria for good assessment: Consensus statement and recommendations from the Ottawa 2010 Conference. Medical Teacher, 33, 206–214. doi: 10.3109/0142159X.2011.551559.CrossRefGoogle Scholar
  57. Norman, G. R. (2008). The glass is a little full-of something: Revisiting the issue of content specificity of problem solving. Medical Education, 42, 549–551. doi: 10.1111/j.1365-2923.2008.03096.x.CrossRefGoogle Scholar
  58. Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Education, 40, 618–623. doi: 10.1111/j.1365-2929.2006.02511.x.CrossRefGoogle Scholar
  59. Norman, G. R., Tugwell, P., Feightner, J. W., Muzzin, L. J., & Jacoby, L. L. (1985). Knowledge and clinical problem-solving. Medical Education, 19(5), 344–356.CrossRefGoogle Scholar
  60. Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory into Practice, 48, 4–11. doi: 10.1080/00405840802577536.CrossRefGoogle Scholar
  61. Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.CrossRefGoogle Scholar
  62. R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria.
  63. Ray, A., & Wu, M. (2003). PISA programme for international student assessment (PISA): PISA 2000 technical report. Paris: OECD Publishing.Google Scholar
  64. Richter Lagha, R. A., Boscardin, C., May, W., & Fung, C.-C. (2012). A comparison of two standard-setting approaches in high-stakes clinical performance assessment using generalizability theory. Academic Medicine, 87, 1077–1082. doi: 10.1097/ACM.0b013e31825cea4b.CrossRefGoogle Scholar
  65. Ricketts, C., Freeman, A., Pagliuca, G., Coombes, L., & Archer, J. (2010). Difficult decisions for progress testing: How much and how often? Medical Teacher, 32, 513–515. doi: 10.3109/0142159X.2010.485651.CrossRefGoogle Scholar
  66. Roberts, J., & Norman, G. (1990). Reliability and learning from the objective structured clinical examination. Medical Education, 24, 219–223. doi: 10.1111/j.1365-2923.1990.tb00004.x.CrossRefGoogle Scholar
  67. Rutkowski, L., von Davier, M., & Rutkowski, D. (2013). Handbook of International large-scale assessment: Background, technical issues, and methods of data analysis. Boca Raton: Chapman and Hall/CRC.Google Scholar
  68. Schuwirth, L. (2009). Is assessment of clinical reasoning still the Holy Grail? Medical Education, 43, 298–300. doi: 10.1111/j.1365-2923.2009.03290.x.CrossRefGoogle Scholar
  69. Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2006). A plea for new psychometric models in educational assessment. Medical Education, 40, 296–300. doi: 10.1111/j.1365-2929.2006.02405.x.CrossRefGoogle Scholar
  70. Schuwirth, L. W. T., & van der Vleuten, C. P. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33, 478–485. doi: 10.3109/0142159X.2011.565828.CrossRefGoogle Scholar
  71. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232. doi: 10.2307/1435044.CrossRefGoogle Scholar
  72. Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36(1), 61–71.CrossRefGoogle Scholar
  73. Sijtsma, K. (2006). Psychometrics in psychological research: Role model or partner in science? Psychometrika, 71, 451–455. doi: 10.1007/s11336-006-1497-9.CrossRefGoogle Scholar
  74. Skrondal, A., & Rabe-Hesketh, S. (2007). Latent variable modelling: A survey. Scandinavian Journal of Statistics, 34, 712–745. doi: 10.1111/j.1467-9469.2007.00573.x.CrossRefGoogle Scholar
  75. Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 6, 649–744. doi: 10.1016/0030-5073(71)90033-X.CrossRefGoogle Scholar
  76. Swanson, D. B., Norman, G. R., & Linn, R. L. (1995). Performance-based assessment: Lessons from the health professions. Educational Researcher, 24, 5–11. doi: 10.3102/0013189X024005005.CrossRefGoogle Scholar
  77. van der Vleuten, C. P. M. (2014). When I say … context specificity. Medical Education, 48, 234–235. doi: 10.1111/medu.12263.CrossRefGoogle Scholar
  78. van der Vleuten, C. P. M., Schuwirth, L. W. T., Driessen, E. W., Govaerts, M. J. B., & Heeneman, S. (2014). 12 Tips for programmatic assessment. Medical Teacher. doi: 10.3109/0142159X.2014.973388.Google Scholar
  79. von Davier, M., Sinharay, S., Oranje, A., & Beaton, A. (2006). The statistical procedures used in national assessment of educational progress: Recent developments and future directions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 1039–1055). Amsterdam: Elsevier.CrossRefGoogle Scholar
  80. Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 81–124, Handbook of Statistics): Elsevier Science.Google Scholar
  81. Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  82. Wimmers, P. F., & Fung, C.-C. (2008). The impact of case specificity and generalisable skills on clinical performance: A correlated traits–correlated methods approach. Medical Education, 42, 580–588. doi: 10.1111/j.1365-2923.2008.03089.x.CrossRefGoogle Scholar
  83. Wimmers, P. F., Splinter, T. A., Hancock, G. R., & Schmidt, H. G. (2007). Clinical competence: General ability or case-specific? Advances in Health Science Education, 12, 299–314. doi: 10.1007/s10459-006-9002-x.CrossRefGoogle Scholar
  84. Wrigley, W., van der Vleuten, C. P. M., Freeman, A., & Muijtjens, A. (2012). A systemic framework for the progress test: Strengths, constraints and issues: AMEE Guide No. 71. Medical Teacher, 34, 683–697. doi: 10.3109/0142159X.2012.704437.CrossRefGoogle Scholar
  85. Zumbo, B. D. (2006). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 45–80). Amsterdam: Elsevier.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  • Stefan K. Schauber
    • 1
    Email author
  • Martin Hecht
    • 2
  • Zineb M. Nouns
    • 3
  1. 1.Centre for Educational Measurement at the University of Oslo (CEMO) and Centre for Health Sciences EducationUniversity of OsloOsloNorway
  2. 2.Department of PsychologyHumboldt–Universität zu BerlinBerlinGermany
  3. 3.Institute of Medical Education, Faculty of MedicineUniversity of BernBernSwitzerland

Personalised recommendations