Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments

Sebok, Stefanie S.; Roy, Marguerite; Klinger, Don A.; De Champlain, André F.

doi:10.1007/s10459-014-9547-z

Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments

Published: 28 August 2014

Volume 20, pages 581–594, (2015)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

Stefanie S. Sebok¹,
Marguerite Roy²,
Don A. Klinger¹ &
…
André F. De Champlain²

509 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

Examiner effects and content specificity are two well known sources of construct irrelevant variance that present great challenges in performance-based assessments. National medical organizations that are responsible for large-scale performance based assessments experience an additional challenge as they are responsible for administering qualification examinations to physician candidates at several locations and institutions. This study explores the impact of site location as a source of score variation in a large-scale national assessment used to measure the readiness of internationally educated physician candidates for residency programs. Data from the Medical Council of Canada’s National Assessment Collaboration were analyzed using Hierarchical Linear Modeling and Rasch Analyses. Consistent with previous research, problematic variance due to examiner effects and content specificity was found. Additionally, site location was also identified as a potential source of construct irrelevant variance in examination scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

Article Open access 01 March 2022

To a or not to a: On the Use of the Total Score

Modifying Hofstee standard setting for assessments that vary in difficulty, and to determine boundaries for different levels of achievement

Article Open access 28 January 2016

References

Berendonk, C., Stalmeijer, R. E., & Schuwirth, L. W. T. (2013). Expertise in performance assessment: Assessors’ perspectives. Advances in Health Sciences Education, 18, 559–571. doi:10.1007/s10459-012-9392-x.
Article Google Scholar
Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). A systematic review of the reliability of objective structured clinical examination scores. Medical Education, 45, 1181–1189. doi:10.111/j.1365-2923.2011.04075.x.
Article Google Scholar
Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an angoff-style standard-setting procedure. Journal of Educational Measurement, 39, 269–290. doi:10.1111/j.1745-3984.2002.tb01143.x.
Article Google Scholar
Crossley, J., Johnson, G., Booth, J., & Wade, W. (2011). Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Medical Education, 45, 560–569. doi:10.1111/j.1365-2923.2010.03913.x.
Article Google Scholar
De Champlain, A. F., MacMillan, M. K., King, A. M., Klass, D. J., & Margolis, M. J. (1999). Assessing the impacts of intra-site and inter-site checklists recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Academic Medicine, 74, S53–S54.
Google Scholar
Elstein, A. S., Shulman, L. S., & Sprafka, S. A. (1978). Medical problem solving: An analysis of clinical reasoning. Cambridge, MA: Harvard University Press.
Book Google Scholar
Floreck, L. M., & De Champlain, A. F. (2001). Assessing sources of score variability in the multisite medical performance assessment: An application of hierarchical linear modeling. Academic Medicine, 76, S93–S95.
Article Google Scholar
Gibson, N. M., & Olenjnik, S. (2003). Treatment of missing data at the second level of hierarchical linear models. Educational and Psychological Measurement, 63, 204–238. doi:10.1177/0013164402250987.
Article Google Scholar
Green, M. L., & Holmboe, E. (2010). The ACGME toolbox: Half empty or half full? Academic Medicine, 85, 787–790. doi:10.1097/ACM.0b013e3181d737a6.
Article Google Scholar
Harasym, P. H., Woloschuk, W., & Cunning, L. (2008). Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Advances in Health Sciences Education, 13, 617–632. doi:10.1007/s10459-007-9068-0.
Article Google Scholar
Harden, R. M., & Gleeson, F. A. (1979). Assessment of clinical competence using an objective structured clinical examination (OSCE). Medical Education, 13, 41–54.
Google Scholar
Iramaneerat, C., & Yudkowsky, R. (2007). Rater errors in a clinical skills assessment of medical students. Evaluation and the Health Professions, 30, 266–283. doi:10.1177/0163278707304040.
Article Google Scholar
Iramaneerat, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted rasch measurement. Advances in Health Sciences Education, 13, 479–493. doi:10.1007/s10459-007-9060-8.
Article Google Scholar
Kogan, J. R., Conforti, L., Bernabeo, E., Iobst, W., & Holmboe, E. (2011). Opening the black box of clinical skills assessment via observation: A conceptual model. Medical Education, 45, 1048–1060. doi:10.111/j.1365-2923.2011.04025.x.
Article Google Scholar
Landy, F. J., & Farr, J. L. (1980). Performance Rating. Psychological Bulletin, 87, 72–107.
Article Google Scholar
Lawson, D. M. (2006). Applying generalizability theory to high-stakes objective structured clinical examinations in a naturalistic environment. Journal of Manipulative and Physiological Therapeutics, 29, 463–467. doi:10.1016/j.jmpt.2006.06.009.
Article Google Scholar
Linacre, J. M. (1995). Misfit statistics for rating scale categories. Rasch Measurement Transactions, 9, 450.
Google Scholar
Linacre, J. M. (2010). Rasch measurement: Core topics. http://courses.statistics.com/index.php3.
Linacre, J. M. (2011). Facets computer program for many-facet Rasch measurement, version 3.68.1. Beaverton, OR: Winsteps.com.
Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement: Issues and Practice, 13, 5–15. doi:10.1111/j.1745-3992.1994.tb00778.x.
Article Google Scholar
Ma, X., & Klinger, D. A. (2000). Hierarchical linear modeling of student and school effects on academic achievement. Canadian Journal of Education, 25, 41–55.
Article Google Scholar
Medical Council of Canada. (2012). NAC scoring and quality control annual report. Ottawa, ON: Medical Council of Canada.
Google Scholar
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications.
Google Scholar
Sebok, S. S., Luu, K., & Klinger, D. A. (2014). Psychometric properties of the multiple mini-interview used for medical admissions: Findings from generalizability and rasch analyses. Advances in Health Sciences Education, 19, 71–84. doi:10.1007/s10459-013-9463-7.
Article Google Scholar
Swanson, D. B., Clauser, B. E., & Case, S. M. (1999). Clinical skills assessment with standardized patients in high-stakes tests: A framework for thinking about score precision, equating, and security. Advances in Health Sciences Education, 4, 67–106. doi:10.1023/A:1009862220473.
Article Google Scholar
Tavares, W., & Eva, K. W. (2013). Exploring the impact of mental workload on rater-based assessments. Advances in Health Sciences Education, 18, 291–303. doi:10.1007/s10459-012-9370-3.
Article Google Scholar
Wolfe, E., & McVay, A. (2012). Application of latent trait models to identify substantively interesting raters. Educational Measurement: Issues and Practice, 31, 31–37. doi:10.1111/j.1745-3992.2012.00241.x.
Article Google Scholar
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.
Google Scholar
Yeates, P., O’Neill, P., Mann, K., & Eva, K. (2013). Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in Health Sciences Education, 18(325–341), 1045. doi:10.1007/s9-012-9372-1.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Education, Queen’s University, Kingston, Canada
Stefanie S. Sebok & Don A. Klinger
Medical Council of Canada, Ottawa, Canada
Marguerite Roy & André F. De Champlain

Authors

Stefanie S. Sebok
View author publications
You can also search for this author in PubMed Google Scholar
Marguerite Roy
View author publications
You can also search for this author in PubMed Google Scholar
Don A. Klinger
View author publications
You can also search for this author in PubMed Google Scholar
André F. De Champlain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefanie S. Sebok.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sebok, S.S., Roy, M., Klinger, D.A. et al. Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments. Adv in Health Sci Educ 20, 581–594 (2015). https://doi.org/10.1007/s10459-014-9547-z

Download citation

Received: 13 March 2014
Accepted: 14 August 2014
Published: 28 August 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s10459-014-9547-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments

Abstract

Access this article

Similar content being viewed by others

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

To a or not to a: On the Use of the Total Score

Modifying Hofstee standard setting for assessments that vary in difficulty, and to determine boundaries for different levels of achievement

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Examiners and content and site: Oh My! A national organization’s investigation of score variation in large-scale performance assessments

Abstract

Access this article

Similar content being viewed by others

Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes

To a or not to a: On the Use of the Total Score

Modifying Hofstee standard setting for assessments that vary in difficulty, and to determine boundaries for different levels of achievement

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation