Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions

Roberts, William L.; Boulet, John; Sandella, Jeanne

doi:10.1007/s10459-017-9766-1

Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions

Published: 21 February 2017

Volume 22, pages 1279–1292, (2017)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

William L. Roberts¹,
John Boulet² &
Jeanne Sandella¹

325 Accesses
4 Citations
Explore all metrics

Abstract

When the safety of the public is at stake, it is particularly relevant for licensing and credentialing exam agencies to use defensible standard setting methods to categorize candidates into competence categories (e.g., pass/fail). The aim of this study was to gather evidence to support change to the Comprehensive Osteopathic Medical Licensing-USA Level 2-Performance Evaluation standard setting design and administrative process. Twenty-two video recordings of candidates assessed for clinical competence were randomly selected from the 2014–2015 Humanistic domain test score distribution ranging from the highest to lowest quintile of performance. Nineteen panelists convened at the same site to receive training and practice prior to generating judgments of qualified or not qualified performance to each of the twenty videos. At the end of training, one panel remained onsite to complete their judgments and the second panel was released and given 1 week to observe the same twenty videos and complete their judgments offsite. The two one-sided test procedure established equivalence between panel group means at the 0.05 confidence level, controlling for rater errors within each panel group. From a practical cost-effective and administrative resource perspective, results from this study suggest it is possible to diverge from typical panel groups, who are sequestered the entire time onsite, to larger numbers of panelists who can make their judgments offsite with little impact on judged samples of qualified performance. Standard setting designs having panelists train together and then allowing those to provide judgments yields equivalent ratings and, ultimately, similar cut scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The influence of first impressions on subsequent ratings within an OSCE station

Article 15 November 2016

Timothy J. Wood, James Chan, … Claire Touchie

Validity of a cardiology fellow performance assessment: reliability and associations with standardized examinations and awards

Article Open access 15 March 2022

Michael W. Cullen, Kyle W. Klarich, … Thomas J. Beckman

Reimagining a pass/fail clinical core clerkship: a US residency program director survey and meta-analysis

Article Open access 24 October 2023

Andrew Wang, Krystal L. Karunungan, … Clarence H. Braddock III

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Anderson, S., & Hauck, W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics: Theory and Methods, 12, 2663–2692.
Article Google Scholar
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. New Jersey: Lawrence Erlbaum Associates.
Google Scholar
Boulet, J. R., De Champlain, A. F., & McKinley, D. W. (2003). Setting defensible performance standards on OSCEs and standardized patient examinations. Medical Teacher, 25(3), 245–249.
Article Google Scholar
Boulet, J. R., Smee, S. M., Dillon, G. F., & Gimpel, J. R. (2009). The use of standardized patient assessments for certification and licensure decisions. Simulation in Health Care, 4(1), 35–42. doi:10.1097/SIH.0b013e318182fc6c.
Article Google Scholar
Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational Measurement: Issues and Practice, 14(4), 9–12. doi:10.1111/j.1745-3992.1995.tb00882.x.
Article Google Scholar
Cizek, G. J. (2006). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.
Google Scholar
Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.
Book Google Scholar
Cohen, A. S., Kane, M. T., & Crooks, T. J. (1999). A Generalized examinee-centered method for setting standards on achievement tests. Applied Measurement in Education, 12(4), 343–366.
Article Google Scholar
Commission on Osteopathic College Accreditation. (2013). Accreditation of colleges of osteopathic medicine: COM accreditation standards and proceedures (pp. 1–85). Chicago, IL: American Osteopathic Association.
Google Scholar
Cusimano, M. D. (1996). Standard setting in medical education. [Supplement]. Academic Medicine, 71(10), S112–S120.
Article Google Scholar
De Champlain, A. (2004). Ensuring that the competent are truly competent: An overview of common methods and procedures used to set standards on high-stakes examinations. Journal of Veterinary Medical Education, 31(1), 62–66. doi:10.3138/jvme.31.1.62.
Article Google Scholar
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (Vol. 22). New York: Peter Lang GmbH.
Google Scholar
Engelhard, G. (2009). Evaluating the judgments of standard setting panelists using Rasch measurement theory. In E. V. Smith, Jr. & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 312–346). Mapel Grove, MN: JAM Press.
Google Scholar
Engelhard, G., & Stone, G. E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179–196. doi:10.1177/0013164498058002003.
Article Google Scholar
Hambleton, R. K., Jaeger, R. M., Plake, B. S., & Mills, C. (2000). Setting performance standards on complex educational assessments. Applied Psychological Measurement, 24(4), 355–366.
Article Google Scholar
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: Prager Publishers.
Google Scholar
Iramaneerant, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Advances in Health Sciences Education, 13(4), 479–793. doi:10.1007/s10459-007-9060-8.
Article Google Scholar
Jaeger, R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 485–514). New York: Macmillan.
Google Scholar
Kaliski, P. K., Wind, S. A., Engelhard, G., Morgan, D. L., Plake, B. S., & Reshetar, R. A. (2013). Using the many-faceted Rasch model to evaluate standard setting judgments: An illustration with the advanced placement environmental science exam. Educational and Psychological Measurement, 73(3), 386–411. doi:10.1177/0013164412468448.
Article Google Scholar
Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425–461.
Article Google Scholar
Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting Performance Standards (pp. 53–88). New Jersey: Lawrence Erlbaum Associates.
Google Scholar
Langenau, E. E., Dyer, C., Roberts, W. L., Wilson, C., & Gimpel, J. (2010). Five-year summary of COMLEX-USA Level 2-PE examinee performance and survey data. The Journal of the American Osteopathic Association, 110(3), 114–125.
Google Scholar
Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA.
Google Scholar
Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard Jr. & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 85–98). Santa Barbara, CA: Praeger.
Google Scholar
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean?. Rasch Measurement Transactions, 16, 878. http://www.rasch.org/rmt/rmt162f.htm.
Linacre, J. M. (2010). FACETS Rasch measurement computer program (Version 3.66.3). Chicago: Winsteps.com.
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests (pp. 1–71). Princeton, NJ: Educational Testing Service.
Google Scholar
Longford, N. T. (1996). Reconciling experts’ differences in setting cut scores for pass-fail decisions. Journal of Educational and Behavioral Statistics, 21(3), 203–213. doi:10.3102/10769986021003203.
Article Google Scholar
Lunz, M. E. (2000). Setting standards on performance examinations. In M. Wilson & G. Engelhard Jr. (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 181–199). Norwood, NJ: Ablex.
Google Scholar
Marcoulides, G. A. (1999). Generalizability theory: Picking up where the Rasch IRT model leaves off? In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every psychologist and educator should know (pp. 129–152). Mahwah, NJ: Lawrence Erlbaum.
Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: ACE/Macmillan.
Google Scholar
Messick, S. (1995). Standards of validity and the validity of standards in performance asessment. Educational Measurement: Issues and Practice, 14(4), 5–8. doi:10.1111/j.1745-3992.1995.tb00881.x.
Article Google Scholar
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
Google Scholar
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.
Google Scholar
Norcini, J. J. (2003). Setting standards on educational tests. Medical Education, 37(5), 464–469. doi:10.1046/j.1365-2923.2003.01495.x.
Article Google Scholar
Norcini, J. J., & Shea, J. A. (1997). The credibility of standards and comparability of standards. Applied Measurement in Education, 10(1), 39–59.
Article Google Scholar
Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard-setting. Educational Measurement: Issues and Practice, 10(2), 15–16.
Article Google Scholar
Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 119–158). Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
SAS Institute Inc. (2016). SAS software. Cary, NC: SAS Institute Inc.
Google Scholar
Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680. doi:10.1007/bf01068419.
Article Google Scholar
Stahl, J. A. (1994). What does generalizability theory offer that many-facet Rasch measurement cannot duplicate? Rasch Measurement Transactions, 8(1), 342–343.
Google Scholar
Stone, G. E., Beltyukova, S., & Fox, C. M. (2008). Objective Standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180–196. doi:10.1080/15305050802007083.
Article Google Scholar
Wright, B., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. http://www.rasch.org/rmt/rmt83b.htm.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Google Scholar
Yin, P. (2004). A multivariate generalizability analysis of the multistate bar examination (pp. 1–18). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.
Google Scholar
Yin, P., & Sconing, J. (2008). Estimating standard errors of cut scores for item rating and mapmark procedures: A generalizability theory approach. Educational and Psychological Measurement, 68(1), 25–41. doi:10.1177/0013164407301546.
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Board of Osteopathic Medical Examiners, Inc, 101 West Elm StreetSuite 150, Conshohocken, PA, 19428-2004, USA
William L. Roberts & Jeanne Sandella
Foundation for Advancement of International Medical Education and Research, 3624 Market Street, Philadelphia, PA, 19104-2685, USA
John Boulet

Authors

William L. Roberts
View author publications
You can also search for this author in PubMed Google Scholar
John Boulet
View author publications
You can also search for this author in PubMed Google Scholar
Jeanne Sandella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William L. Roberts.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roberts, W.L., Boulet, J. & Sandella, J. Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions. Adv in Health Sci Educ 22, 1279–1292 (2017). https://doi.org/10.1007/s10459-017-9766-1

Download citation

Received: 02 August 2016
Accepted: 14 February 2017
Published: 21 February 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10459-017-9766-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions

Abstract

Access this article

Similar content being viewed by others

The influence of first impressions on subsequent ratings within an OSCE station

Validity of a cardiology fellow performance assessment: reliability and associations with standardized examinations and awards

Reimagining a pass/fail clinical core clerkship: a US residency program director survey and meta-analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions

Abstract

Access this article

Similar content being viewed by others

The influence of first impressions on subsequent ratings within an OSCE station

Validity of a cardiology fellow performance assessment: reliability and associations with standardized examinations and awards

Reimagining a pass/fail clinical core clerkship: a US residency program director survey and meta-analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation