Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format

Park, Yoon Soo; Hyderi, Abbas; Bordage, Georges; Xing, Kuan; Yudkowsky, Rachel

doi:10.1007/s10459-015-9664-3

Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format

Published: 12 January 2016

Volume 21, pages 761–773, (2016)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

Yoon Soo Park¹,
Abbas Hyderi²,
Georges Bordage¹,
Kuan Xing¹ &
…
Rachel Yudkowsky¹

1168 Accesses
18 Citations
2 Altmetric
2 Mentions
Explore all metrics

Abstract

Recent changes to the patient note (PN) format of the United States Medical Licensing Examination have challenged medical schools to improve the instruction and assessment of students taking the Step-2 clinical skills examination. The purpose of this study was to gather validity evidence regarding response process and internal structure, focusing on inter-rater reliability and generalizability, to determine whether a locally-developed PN scoring rubric and scoring guidelines could yield reproducible PN scores. A randomly selected subsample of historical data (post-encounter PN from 55 of 177 medical students) was rescored by six trained faculty raters in November–December 2014. Inter-rater reliability (% exact agreement and kappa) was calculated for five standardized patient cases administered in a local graduation competency examination. Generalizability studies were conducted to examine the overall reliability. Qualitative data were collected through surveys and a rater-debriefing meeting. The overall inter-rater reliability (weighted kappa) was .79 (Documentation = .63, Differential Diagnosis = .90, Justification = .48, and Workup = .54). The majority of score variance was due to case specificity (13 %) and case-task specificity (31 %), indicating differences in student performance by case and by case-task interactions. Variance associated with raters and its interactions were modest (<5 %). Raters felt that justification was the most difficult task to score and that having case and level-specific scoring guidelines during training was most helpful for calibration. The overall inter-rater reliability indicates high level of confidence in the consistency of note scores. Designs for scoring notes may optimize reliability by balancing the number of raters and cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Undergraduate paramedic students and interpersonal communication development: a scoping review

Article Open access 19 July 2022

Jennifer Mangan, John Rae, … Donovan Jones

Best Practices for Reducing Bias in the Interview Process

Article 12 October 2022

Ilana Bergelson, Chad Tracy & Elizabeth Takacs

Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: a systematic metareview

Article Open access 29 June 2020

Verónica Ciro Correa, Luz Helena Lugo-Agudelo, … Dolly Andrea Castaño Valencia

References

AERA, APA, & NCME. (2014). The standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Boulet, J. R., Ben-David, M. F., et al. (1998). An investigation of the sources of measurement error in the post-encounter written scores from standardized patient examinations. Advances in Health Sciences Education: Theory and Practice, 3, 89–100.
Article Google Scholar
Boulet, J. R., Rebbecchi, T. A., et al. (2004). Assessing the written communication skills of medical school graduates. Advances in Health Sciences Education: Theory and Practice, 9, 47–60.
Article Google Scholar
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer.
Book Google Scholar
Cianciolo, A. T., Williams, R. G., et al. (2013). Biomedical knowledge, clinical cognition and diagnostic justification: A structural equation model. Medical Education, 47, 309–316.
Article Google Scholar
Clauser, B. E., Harik, P., et al. (2008). The generalizability of documentation scores from the USMLE step 2 clinical skills examination. Academic Medicine, 83, S41–S44.
Article Google Scholar
Downing, S. M. (2003). Validity: On meaningful interpretation of assessment data. Medical Education, 37, 830–837.
Article Google Scholar
Federation of the State Medical Boards & National Board of Medical Examiners. (2015). Step 2 clinical skills (CS) content description and general information. Philadelphia, PA: United States Medical Licensing Examination. Retrieved June 25, 2015 from http://www.usmle.org/pdfs/step-2-cs/cs-info-manual.pdf
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
Article Google Scholar
Gilliland, W. R., La Rochelle, J., et al. (2008). Changes in clinical skills education resulting from the introduction of the USMLE step 2 clinical skills (CS) examination. Medical Teacher, 30, 325–327.
Article Google Scholar
Gingerich, A., Kogan, J., et al. (2014). Seeing the ‘black box’ differently: Assessor cognition from three research perspectives. Medical Education, 48, 1055–1068.
Article Google Scholar
Haist, S. A., Katsufrakis, P. J., et al. (2013). The evolution of the United States Medical Licensing Examination (USMLE): Enhancing assessment of practice-related competencies. Journal of the American Medical Association, 310, 2245–2246.
Article Google Scholar
Hombo, C. M., Donoghue, J. R., et al. (2001). A simulation study of the effect of rater designs on ability estimation (ETS Research Report No. RR-01-05). Princeton, NJ: ETS. Retrieved June 25, 2015 from http://www.ets.org/research/policy_research_reports/publications/report/2001/hseq
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Article Google Scholar
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14, 5–8.
Article Google Scholar
National Board of Medical Examiners. (2015). NBME 2013 Annual Report. Philadelphia, PA: NBME. Retrieved June 25, 2015 from http://www.nbme.org/PDF/Publications/2014Annual-Report.pdf
Park, Y. S., Lineberry, M., et al. (2013). Validity evidence for a patient note scoring rubric based on the new patient note format of the United States Medical Licensing Examination. Academic Medicine, 88, 1552–1557.
Article Google Scholar
Southern Illinois University. (2015). DX Justification Scoring Form. Carbondale, IL: SIU School of Medicine. Retrieved June 25, 2015 from http://www.siumed.edu/oec/CCX_ASSESSMENTS/2015/DX%20Justification_scoring%20form.pdf
United States Medical Licensing Examination. (2015). 2014 Performance Data: Step 2 CS. Philadelphia, PA: NBME. Retrieved June 25, 2015, from http://www.usmle.org/performance-data/default.aspx#2014_step-2-cs
Whelan, G. P. (1999). Educational commission for foreign medical graduates: Clinical skills assessment prototype. Medical Teacher, 21, 156–160.
Article Google Scholar
Williams, R. G., Klamen, D. A., et al. (2003). Special Article: Social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15, 270–292.
Article Google Scholar
Williams, R. G., Klamen, D. L., et al. (2014). Variations in senior medical student diagnostic justification ability. Academic Medicine, 89, 790–798.
Article Google Scholar
Yudkowsky, R., Park, Y. S., et al. (2015). Characteristics and implications of diagnostic justification scores based on the new patient note format of the USMLE Step 2 CS exam. Academic Medicine, 90, S56–S62.
Article Google Scholar

Download references

Acknowledgments

The authors thank the following faculty raters who participated in rescoring the patient notes for this study: Ananya Gangopadhyaya, MD, Nimmi Rajagopal, MD, Olga Garcia-Bedoya, MD, Alexandra Van Meter, MD, and Asra Khan, MD. The authors also thank Robert Kiser for creating an online scoring system to compile rater scores.

Author information

Authors and Affiliations

Department of Medical Education (MC 591), College of Medicine, University of Illinois at Chicago, 808 South Wood Street, 963 CMET, Chicago, IL, 60612-7309, USA
Yoon Soo Park, Georges Bordage, Kuan Xing & Rachel Yudkowsky
Department of Family Medicine (MC 785), College of Medicine, University of Illinois at Chicago, 1819 West Polk Street, 150 CMW, Chicago, IL, 60612-7309, USA
Abbas Hyderi

Authors

Yoon Soo Park
View author publications
You can also search for this author in PubMed Google Scholar
Abbas Hyderi
View author publications
You can also search for this author in PubMed Google Scholar
Georges Bordage
View author publications
You can also search for this author in PubMed Google Scholar
Kuan Xing
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Yudkowsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoon Soo Park.

Ethics declarations

Conflict of interest

None.

Ethical standards

This study was approved by the institutional review board of the University of Illinois at Chicago.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, Y.S., Hyderi, A., Bordage, G. et al. Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format. Adv in Health Sci Educ 21, 761–773 (2016). https://doi.org/10.1007/s10459-015-9664-3

Download citation

Received: 01 July 2015
Accepted: 22 December 2015
Published: 12 January 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10459-015-9664-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format

Abstract

Access this article

Similar content being viewed by others

Undergraduate paramedic students and interpersonal communication development: a scoping review

Best Practices for Reducing Bias in the Interview Process

Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: a systematic metareview

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical standards

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format

Abstract

Access this article

Similar content being viewed by others

Undergraduate paramedic students and interpersonal communication development: a scoping review

Best Practices for Reducing Bias in the Interview Process

Individual, health system, and contextual barriers and facilitators for the implementation of clinical practice guidelines: a systematic metareview

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical standards

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation