This paper aims to assess the interrater reliability of standardized patients (SPs) as they assess the clinical skills of medical students and to detect possible rating bias in SPs. The ratings received by 6 students examined in 4 clinical stations by 13 SPs were examined. Each SP contributed at least 3 and at most 10 pairwise ratings, with an average of approximately 5 ratings per SP. The standard Cohen’ kappa statistic was calculated and the distribution of scores among SPs was compared via both ANOVA the Kruskal-Wallis H test (one-way ANOVA by ranks). Furthermore, a number of discrepancies between pairwise raters (showing either “positive” or “negative” bias in the rating) were analyzed using ANOVA and a χ 2 goodness-of-fit test. The conventional method, which compared the statistics of kappa scores of the raters (including the prevalence-adjusted bias-adjusted kappa scores), did not reject the null hypothesis that the raters (SPs) are similar. However, the analysis of the distribution of the discrepancies among the raters revealed that the differences between raters cannot be attributed to chance, particularly when a distinction was made between their overall positive and negative bias. A strong (p < 0.001) negative bias was detected, and the SPs responsible for this bias have been identified. The statistical method suggested here, which takes into account explicitly the positive and the negative bias of the raters, is more sensitive than the conventional method (Cohens’ kappa). Since the outliers (the biased SPs) affect the fairness of the grading of the medical students, it is important to detect any statistically significant bias in the rating and to adjust correspondingly the SP’s assessment.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Hawkins RE, Swanson DB, Dillon GF, Clauser BE, King AM, Scoles PV, et al. The introduction of clinical skills assessment into the United States medical licensing examination (USMLE): description of USMLE step 2 clinical skills (CS). J Med Licensure Discipline. 2005;91:21–5.
Dillon GF, Boulet JR, Hawkins RE, Swanson DB. Simulations in the United States medical licensing examination™ (USMLE™). Qual Saf Health Care. 2004;13(Suppl1):141–5. doi:10.1136/qshc.2004.010025.
2015 National Board of Medical Examiners (NBCM) annual report. http://www.nbme.org/publications/.
Van der Vleuten CPM, Swanson DB. Assessment of clinical skills with standardized patients: state of the art. Teach Learn Med. 1990;2:58–76.
Stillman P, Swanson D, Regan MB, et al. Assessment of clinical skills of residents utilizing standardized patients: a follow-up study and recommendations for application. Ann Intern Med. 1991;114:393–401.
Epstein RM. Assessment in medical education. N Engl J Med. 2007;356(4):387–96. doi:10.1056/NEJMra054784.
Fiscella K, Franks P, Srinivasan M, Kravitz RL, Epstein R. Ratings of physician communication by real and standardized patients. Ann Fam Med. 2007;5(2):151–8. doi:10.1370/afm.643.
Dabrh AM, Murad MH, Newcomb RD, Buchta WG, Steffen MW, Wang Z, et al. Proficiency in identifying, managing and communicating medical errors: feasibility and validity study assessing two core competencies. BMC Med Educ. 2016;16(1):233. doi:10.1186/s12909-016-0755-5.
Szklo M, Nieto FJ. Epidemiology beyond the basics. Gaithersburg: Aspen Publishers, Inc.; 2000.
Vierkant RA. A SAS® macro for calculating bootstrapped confidence intervals about a kappa coefficient. Available at http://www2.sas.com/proceedings/sugi22/STATS/PAPER295.PDF Accessed 27 Oct 2016.
Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423–9. doi:10.1016/0895-4356(93)90018-V.
Colliver JA, Morrison LJ, Markwell SJ, Verhulst SJ, Steward DE, Dawson-Saunders E, et al. Three studies of the effect of multiple standardized patients on intercase reliability of five standardized-patient examinations. Teach Learn Med : Int J. 1990;2(4):237–45.
Setyonugroho W, Kennedy KM, Kropmans TJ. Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: a systematic review. Patient Educ Couns. 2015;98:1482–91.
Research reported in this paper was supported by the National Institute of General Medical Sciences of the National Institutes of Health under linked Award Numbers RL5GM118969, TL4GM118971, and UL1GM118970. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
About this article
Cite this article
Manciu, M., Trevino, R., Mulla, Z.D. et al. Detection of Biased Rating of Medical Students by Standardized Patients: Opportunity for Improvement. Med.Sci.Educ. 27, 497–502 (2017). https://doi.org/10.1007/s40670-017-0418-0
- Standardized patients
- Inter-rater agreement