Abstract
Examiner effects and content specificity are two well known sources of construct irrelevant variance that present great challenges in performance-based assessments. National medical organizations that are responsible for large-scale performance based assessments experience an additional challenge as they are responsible for administering qualification examinations to physician candidates at several locations and institutions. This study explores the impact of site location as a source of score variation in a large-scale national assessment used to measure the readiness of internationally educated physician candidates for residency programs. Data from the Medical Council of Canada’s National Assessment Collaboration were analyzed using Hierarchical Linear Modeling and Rasch Analyses. Consistent with previous research, problematic variance due to examiner effects and content specificity was found. Additionally, site location was also identified as a potential source of construct irrelevant variance in examination scores.
Similar content being viewed by others
References
Berendonk, C., Stalmeijer, R. E., & Schuwirth, L. W. T. (2013). Expertise in performance assessment: Assessors’ perspectives. Advances in Health Sciences Education, 18, 559–571. doi:10.1007/s10459-012-9392-x.
Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi:10.111/j.1365-2923.2011.04075.x.
Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an angoff-style standard-setting procedure. Journal of Educational Measurement, 39, 269–290. doi:10.1111/j.1745-3984.2002.tb01143.x.
Crossley, J., Johnson, G., Booth, J., & Wade, W. (2011). Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Medical Education, 45, 560–569. doi:10.1111/j.1365-2923.2010.03913.x.
De Champlain, A. F., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklists recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74, S53–S54.
Elstein, A. S., Shulman, L. S., & Sprafka, S. A. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard University Press.
Floreck, L. M., & De Champlain, A. F. (2001). Assessing sources of score variability in the multisite medical performance assessment: An application of hierarchical linear modeling. Academic Medicine, 76, S93–S95.
Gibson, N. M., & Olenjnik, S. (2003). Treatment of missing data at the second level of hierarchical linear models. Educational and Psychological Measurement, 63, 204–238. doi:10.1177/0013164402250987.
Green, M. L., & Holmboe, E. (2010). The ACGME toolbox: Half empty or half full? Academic Medicine, 85, 787–790. doi:10.1097/ACM.0b013e3181d737a6.
Harasym, P. H., Woloschuk, W., & Cunning, L. (2008). Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Advances in Health Sciences Education, 13, 617–632. doi:10.1007/s10459-007-9068-0.
Harden, R. M., & Gleeson, F. A. (1979). Assessment of clinical competence using an objective structured clinical examination (OSCE). Medical Education, 13, 41–54.
Iramaneerat, C., & Yudkowsky, R. (2007). Rater errors in a clinical skills assessment of medical students. Evaluation and the Health Professions, 30, 266–283. doi:10.1177/0163278707304040.
Iramaneerat, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted rasch measurement. Advances in Health Sciences Education, 13, 479–493. doi:10.1007/s10459-007-9060-8.
Kogan, J. R., Conforti, L., Bernabeo, E., Iobst, W., & Holmboe, E. (2011). Opening the black box of clinical skills assessment via observation: A conceptual model. Medical Education, 45, 1048–1060. doi:10.111/j.1365-2923.2011.04025.x.
Landy, F. J., & Farr, J. L. (1980). Performance Rating. Psychological Bulletin, 87, 72–107.
Lawson, D. M. (2006). Applying generalizability theory to high-stakes objective structured clinical examinations in a naturalistic environment. Journal of Manipulative and Physiological Therapeutics, 29, 463–467. doi:10.1016/j.jmpt.2006.06.009.
Linacre, J. M. (1995). Misfit statistics for rating scale categories. Rasch Measurement Transactions, 9, 450.
Linacre, J. M. (2010). Rasch measurement: Core topics. http://courses.statistics.com/index.php3.
Linacre, J. M. (2011). Facets computer program for many-facet Rasch measurement, version 3.68.1. Beaverton, OR: Winsteps.com.
Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice, 13, 5–15. doi:10.1111/j.1745-3992.1994.tb00778.x.
Ma, X., & Klinger, D. A. (2000). Hierarchical linear modeling of student and school effects on academic achievement. Canadian Journal of Education, 25, 41–55.
Medical Council of Canada. (2012). NAC scoring and quality control annual report. Ottawa, ON: Medical Council of Canada.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications.
Sebok, S. S., Luu, K., & Klinger, D. A. (2014). Psychometric properties of the multiple mini-interview used for medical admissions: Findings from generalizability and rasch analyses. Advances in Health Sciences Education, 19, 71–84. doi:10.1007/s10459-013-9463-7.
Swanson, D. B., Clauser, B. E., & Case, S. M. (1999). Clinical skills assessment with standardized patients in high-stakes tests: A framework for thinking about score precision, equating, and security. Advances in Health Sciences Education, 4, 67–106. doi:10.1023/A:1009862220473.
Tavares, W., & Eva, K. W. (2013). Exploring the impact of mental workload on rater-based assessments. Advances in Health Sciences Education, 18, 291–303. doi:10.1007/s10459-012-9370-3.
Wolfe, E., & McVay, A. (2012). Application of latent trait models to identify substantively interesting raters. Educational Measurement: Issues and Practice, 31, 31–37. doi:10.1111/j.1745-3992.2012.00241.x.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.
Yeates, P., O’Neill, P., Mann, K., & Eva, K. (2013). Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in Health Sciences Education, 18(325–341), 1045. doi:10.1007/s9-012-9372-1.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sebok, S.S., Roy, M., Klinger, D.A. et al. Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments. Adv in Health Sci Educ 20, 581–594 (2015). https://doi.org/10.1007/s10459-014-9547-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10459-014-9547-z