Detection of Biased Rating of Medical Students by Standardized Patients: Opportunity for Improvement


This paper aims to assess the interrater reliability of standardized patients (SPs) as they assess the clinical skills of medical students and to detect possible rating bias in SPs. The ratings received by 6 students examined in 4 clinical stations by 13 SPs were examined. Each SP contributed at least 3 and at most 10 pairwise ratings, with an average of approximately 5 ratings per SP. The standard Cohen’ kappa statistic was calculated and the distribution of scores among SPs was compared via both ANOVA the Kruskal-Wallis H test (one-way ANOVA by ranks). Furthermore, a number of discrepancies between pairwise raters (showing either “positive” or “negative” bias in the rating) were analyzed using ANOVA and a χ 2 goodness-of-fit test. The conventional method, which compared the statistics of kappa scores of the raters (including the prevalence-adjusted bias-adjusted kappa scores), did not reject the null hypothesis that the raters (SPs) are similar. However, the analysis of the distribution of the discrepancies among the raters revealed that the differences between raters cannot be attributed to chance, particularly when a distinction was made between their overall positive and negative bias. A strong (p < 0.001) negative bias was detected, and the SPs responsible for this bias have been identified. The statistical method suggested here, which takes into account explicitly the positive and the negative bias of the raters, is more sensitive than the conventional method (Cohens’ kappa). Since the outliers (the biased SPs) affect the fairness of the grading of the medical students, it is important to detect any statistically significant bias in the rating and to adjust correspondingly the SP’s assessment.

This is a preview of subscription content, log in to check access.

Fig 1
Fig. 2


  1. 1.

    Hawkins RE, Swanson DB, Dillon GF, Clauser BE, King AM, Scoles PV, et al. The introduction of clinical skills assessment into the United States medical licensing examination (USMLE): description of USMLE step 2 clinical skills (CS). J Med Licensure Discipline. 2005;91:21–5.

    Google Scholar 

  2. 2.

    Dillon GF, Boulet JR, Hawkins RE, Swanson DB. Simulations in the United States medical licensing examination (USMLE). Qual Saf Health Care. 2004;13(Suppl1):141–5. doi:10.1136/qshc.2004.010025.

    Google Scholar 

  3. 3.

    2015 National Board of Medical Examiners (NBCM) annual report.

  4. 4.

    Van der Vleuten CPM, Swanson DB. Assessment of clinical skills with standardized patients: state of the art. Teach Learn Med. 1990;2:58–76.

    Article  Google Scholar 

  5. 5.

    Stillman P, Swanson D, Regan MB, et al. Assessment of clinical skills of residents utilizing standardized patients: a follow-up study and recommendations for application. Ann Intern Med. 1991;114:393–401.

    Article  Google Scholar 

  6. 6.

    Epstein RM. Assessment in medical education. N Engl J Med. 2007;356(4):387–96. doi:10.1056/NEJMra054784.

    Article  Google Scholar 

  7. 7.

    Fiscella K, Franks P, Srinivasan M, Kravitz RL, Epstein R. Ratings of physician communication by real and standardized patients. Ann Fam Med. 2007;5(2):151–8. doi:10.1370/afm.643.

    Article  Google Scholar 

  8. 8.

    Dabrh AM, Murad MH, Newcomb RD, Buchta WG, Steffen MW, Wang Z, et al. Proficiency in identifying, managing and communicating medical errors: feasibility and validity study assessing two core competencies. BMC Med Educ. 2016;16(1):233. doi:10.1186/s12909-016-0755-5.

    Article  Google Scholar 

  9. 9.

    Szklo M, Nieto FJ. Epidemiology beyond the basics. Gaithersburg: Aspen Publishers, Inc.; 2000.

    Google Scholar 

  10. 10.

    Vierkant RA. A SAS® macro for calculating bootstrapped confidence intervals about a kappa coefficient. Available at Accessed 27 Oct 2016.

  11. 11.

    Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423–9. doi:10.1016/0895-4356(93)90018-V.

    Article  Google Scholar 

  12. 12.

    Colliver JA, Morrison LJ, Markwell SJ, Verhulst SJ, Steward DE, Dawson-Saunders E, et al. Three studies of the effect of multiple standardized patients on intercase reliability of five standardized-patient examinations. Teach Learn Med : Int J. 1990;2(4):237–45.

    Article  Google Scholar 

  13. 13.

    Setyonugroho W, Kennedy KM, Kropmans TJ. Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: a systematic review. Patient Educ Couns. 2015;98:1482–91.

    Article  Google Scholar 

Download references


Research reported in this paper was supported by the National Institute of General Medical Sciences of the National Institutes of Health under linked Award Numbers RL5GM118969, TL4GM118971, and UL1GM118970. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information



Corresponding author

Correspondence to Marian Manciu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Manciu, M., Trevino, R., Mulla, Z.D. et al. Detection of Biased Rating of Medical Students by Standardized Patients: Opportunity for Improvement. Med.Sci.Educ. 27, 497–502 (2017).

Download citation


  • Standardized patients
  • Inter-rater agreement