Skip to main content
Log in

Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions

  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript

Abstract

When the safety of the public is at stake, it is particularly relevant for licensing and credentialing exam agencies to use defensible standard setting methods to categorize candidates into competence categories (e.g., pass/fail). The aim of this study was to gather evidence to support change to the Comprehensive Osteopathic Medical Licensing-USA Level 2-Performance Evaluation standard setting design and administrative process. Twenty-two video recordings of candidates assessed for clinical competence were randomly selected from the 2014–2015 Humanistic domain test score distribution ranging from the highest to lowest quintile of performance. Nineteen panelists convened at the same site to receive training and practice prior to generating judgments of qualified or not qualified performance to each of the twenty videos. At the end of training, one panel remained onsite to complete their judgments and the second panel was released and given 1 week to observe the same twenty videos and complete their judgments offsite. The two one-sided test procedure established equivalence between panel group means at the 0.05 confidence level, controlling for rater errors within each panel group. From a practical cost-effective and administrative resource perspective, results from this study suggest it is possible to diverge from typical panel groups, who are sequestered the entire time onsite, to larger numbers of panelists who can make their judgments offsite with little impact on judged samples of qualified performance. Standard setting designs having panelists train together and then allowing those to provide judgments yields equivalent ratings and, ultimately, similar cut scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  • Anderson, S., & Hauck, W. (1983). A new procedure for testing equivalence in comparative bioavailability and other clinical trials. Communications in Statistics: Theory and Methods, 12, 2663–2692.

    Article  Google Scholar 

  • Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  • Boulet, J. R., De Champlain, A. F., & McKinley, D. W. (2003). Setting defensible performance standards on OSCEs and standardized patient examinations. Medical Teacher, 25(3), 245–249.

    Article  Google Scholar 

  • Boulet, J. R., Smee, S. M., Dillon, G. F., & Gimpel, J. R. (2009). The use of standardized patient assessments for certification and licensure decisions. Simulation in Health Care, 4(1), 35–42. doi:10.1097/SIH.0b013e318182fc6c.

    Article  Google Scholar 

  • Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational Measurement: Issues and Practice, 14(4), 9–12. doi:10.1111/j.1745-3992.1995.tb00882.x.

    Article  Google Scholar 

  • Cizek, G. J. (2006). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.

    Google Scholar 

  • Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.

    Book  Google Scholar 

  • Cohen, A. S., Kane, M. T., & Crooks, T. J. (1999). A Generalized examinee-centered method for setting standards on achievement tests. Applied Measurement in Education, 12(4), 343–366.

    Article  Google Scholar 

  • Commission on Osteopathic College Accreditation. (2013). Accreditation of colleges of osteopathic medicine: COM accreditation standards and proceedures (pp. 1–85). Chicago, IL: American Osteopathic Association.

    Google Scholar 

  • Cusimano, M. D. (1996). Standard setting in medical education. [Supplement]. Academic Medicine, 71(10), S112–S120.

    Article  Google Scholar 

  • De Champlain, A. (2004). Ensuring that the competent are truly competent: An overview of common methods and procedures used to set standards on high-stakes examinations. Journal of Veterinary Medical Education, 31(1), 62–66. doi:10.3138/jvme.31.1.62.

    Article  Google Scholar 

  • Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (Vol. 22). New York: Peter Lang GmbH.

    Google Scholar 

  • Engelhard, G. (2009). Evaluating the judgments of standard setting panelists using Rasch measurement theory. In E. V. Smith, Jr. & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 312–346). Mapel Grove, MN: JAM Press.

    Google Scholar 

  • Engelhard, G., & Stone, G. E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179–196. doi:10.1177/0013164498058002003.

    Article  Google Scholar 

  • Hambleton, R. K., Jaeger, R. M., Plake, B. S., & Mills, C. (2000). Setting performance standards on complex educational assessments. Applied Psychological Measurement, 24(4), 355–366.

    Article  Google Scholar 

  • Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: Prager Publishers.

    Google Scholar 

  • Iramaneerant, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Advances in Health Sciences Education, 13(4), 479–793. doi:10.1007/s10459-007-9060-8.

    Article  Google Scholar 

  • Jaeger, R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 485–514). New York: Macmillan.

    Google Scholar 

  • Kaliski, P. K., Wind, S. A., Engelhard, G., Morgan, D. L., Plake, B. S., & Reshetar, R. A. (2013). Using the many-faceted Rasch model to evaluate standard setting judgments: An illustration with the advanced placement environmental science exam. Educational and Psychological Measurement, 73(3), 386–411. doi:10.1177/0013164412468448.

    Article  Google Scholar 

  • Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425–461.

    Article  Google Scholar 

  • Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting Performance Standards (pp. 53–88). New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  • Langenau, E. E., Dyer, C., Roberts, W. L., Wilson, C., & Gimpel, J. (2010). Five-year summary of COMLEX-USA Level 2-PE examinee performance and survey data. The Journal of the American Osteopathic Association, 110(3), 114–125.

    Google Scholar 

  • Linacre, J. M. (1994). Many-facet Rasch measurement (2nd ed.). Chicago, IL: MESA.

    Google Scholar 

  • Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard Jr. & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 85–98). Santa Barbara, CA: Praeger.

    Google Scholar 

  • Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean?. Rasch Measurement Transactions, 16, 878. http://www.rasch.org/rmt/rmt162f.htm.

  • Linacre, J. M. (2010). FACETS Rasch measurement computer program (Version 3.66.3). Chicago: Winsteps.com.

  • Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests (pp. 1–71). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Longford, N. T. (1996). Reconciling experts’ differences in setting cut scores for pass-fail decisions. Journal of Educational and Behavioral Statistics, 21(3), 203–213. doi:10.3102/10769986021003203.

    Article  Google Scholar 

  • Lunz, M. E. (2000). Setting standards on performance examinations. In M. Wilson & G. Engelhard Jr. (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 181–199). Norwood, NJ: Ablex.

    Google Scholar 

  • Marcoulides, G. A. (1999). Generalizability theory: Picking up where the Rasch IRT model leaves off? In S. E. Embretson & S. L. Hershberger (Eds.), The new rules of measurement: What every psychologist and educator should know (pp. 129–152). Mahwah, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: ACE/Macmillan.

    Google Scholar 

  • Messick, S. (1995). Standards of validity and the validity of standards in performance asessment. Educational Measurement: Issues and Practice, 14(4), 5–8. doi:10.1111/j.1745-3992.1995.tb00881.x.

    Article  Google Scholar 

  • Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

    Google Scholar 

  • Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.

    Google Scholar 

  • Norcini, J. J. (2003). Setting standards on educational tests. Medical Education, 37(5), 464–469. doi:10.1046/j.1365-2923.2003.01495.x.

    Article  Google Scholar 

  • Norcini, J. J., & Shea, J. A. (1997). The credibility of standards and comparability of standards. Applied Measurement in Education, 10(1), 39–59.

    Article  Google Scholar 

  • Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard-setting. Educational Measurement: Issues and Practice, 10(2), 15–16.

    Article  Google Scholar 

  • Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 119–158). Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • SAS Institute Inc. (2016). SAS software. Cary, NC: SAS Institute Inc.

    Google Scholar 

  • Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680. doi:10.1007/bf01068419.

    Article  Google Scholar 

  • Stahl, J. A. (1994). What does generalizability theory offer that many-facet Rasch measurement cannot duplicate? Rasch Measurement Transactions, 8(1), 342–343.

    Google Scholar 

  • Stone, G. E., Beltyukova, S., & Fox, C. M. (2008). Objective Standard setting for judge-mediated examinations. International Journal of Testing, 8(2), 180–196. doi:10.1080/15305050802007083.

    Article  Google Scholar 

  • Wright, B., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. http://www.rasch.org/rmt/rmt83b.htm.

  • Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.

    Google Scholar 

  • Yin, P. (2004). A multivariate generalizability analysis of the multistate bar examination (pp. 1–18). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.

    Google Scholar 

  • Yin, P., & Sconing, J. (2008). Estimating standard errors of cut scores for item rating and mapmark procedures: A generalizability theory approach. Educational and Psychological Measurement, 68(1), 25–41. doi:10.1177/0013164407301546.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William L. Roberts.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roberts, W.L., Boulet, J. & Sandella, J. Comparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions. Adv in Health Sci Educ 22, 1279–1292 (2017). https://doi.org/10.1007/s10459-017-9766-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10459-017-9766-1

Keywords

Navigation