Advances in Health Sciences Education

, Volume 20, Issue 5, pp 1263–1289 | Cite as

Constructing and evaluating a validity argument for the final-year ward simulation exercise

  • Hettie TillEmail author
  • Jean Ker
  • Carol Myford
  • Kevin Stirling
  • Gary Mires


The authors report final-year ward simulation data from the University of Dundee Medical School. Faculty who designed this assessment intend for the final score to represent an individual senior medical student’s level of clinical performance. The results are included in each student’s portfolio as one source of evidence of the student’s capability as a practitioner, professional, and scholar. Our purpose in conducting this study was to illustrate how assessment designers who are creating assessments to evaluate clinical performance might develop propositions and then collect and examine various sources of evidence to construct and evaluate a validity argument. The data were from all 154 medical students who were in their final year of study at the University of Dundee Medical School in the 2010–2011 academic year. To the best of our knowledge, this is the first report on an analysis of senior medical students’ clinical performance while they were taking responsibility for the management of a simulated ward. Using multi-facet Rasch measurement and a generalizability theory approach, we examined various sources of validity evidence that the medical school faculty have gathered for a set of six propositions needed to support their use of scores as measures of students’ clinical ability. Based on our analysis of the evidence, we would conclude that, by and large, the propositions appear to be sound, and the evidence seems to support their proposed score interpretation. Given the body of evidence collected thus far, their intended interpretation seems defensible.


Validity argument Validity propositions Validity evidence Score interpretation Clinical performance Clinical ability Simulated ward Multi-facet Rasch measurement Generalizability theory Simulation in assessment of clinical performance Evaluating medical students’ fitness to practice 


Ethical standard

The manuscript was submitted to the University of Dundee Research Ethics Committee (UREC). It met the ethical standards, but did not require ethical approval.


  1. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). The standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  2. Barsuk, J., Cohen, E., McGaghie, W., & Wayne, D. (2010). Long-term retention of central venous catheter insertion skills after simulation-based mastery learning. Academic Medicine, 85(10 Suppl), S9–S12.CrossRefGoogle Scholar
  3. Bindal, T., Wall, D., & Goodyear, H. M. (2011). Trainee doctors’ views on workplace-based assessments: Are they just a tick box exercise? Medical Teacher, 33(11), 919–927. doi: 10.3109/0142159X.2011.558140.CrossRefGoogle Scholar
  4. Bloch, R. & Norman, G. (2011). G string IV user manual. Retrieved from
  5. Brennan, R. L. (2000). Performance assessments from the perspective of generalizability theory. Applied Psychological Measurement, 24(4), 339–353.CrossRefGoogle Scholar
  6. Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods and reporting quality. Academic Medicine, 88(6), 873–883.CrossRefGoogle Scholar
  7. Crofts, J. F., Ellis, D., Draycott, T. J., Winter, C., Hunt, L. P., & Akande, V. A. (2007). Change in knowledge of midwives and obstetricians following obstetric emergency training: A randomized controlled trial of local hospital, simulation centre and teamwork training. British Journal of Obstetrics and Gynaecology, 114(12), 1534–1541.CrossRefGoogle Scholar
  8. Eckes, T. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt am Main: Peter Lang.Google Scholar
  9. Epstein, R. (2007). Assessment in medical education. The New England Journal of Medicine, 356, 387–396.CrossRefGoogle Scholar
  10. Foundation Programme. (2012, July). The UK foundation programme curriculum updated for August 2014. Retrieved from
  11. General Medical Council. (2009). Tomorrows doctors: Outcomes and standards for undergraduate medical education. London: General Medical Council. Retrieved from
  12. General Medical Council. (2010). Standards for curricula and assessment systems. London: General Medical Council. Retrieved from
  13. General Medical Council. (2011). Assessment in undergraduate medical education. Advice supplementary to Tomorrow’s Doctors (2009). London: General Medical Council. Retrieved from
  14. Grantcharov, T. P., Kristiansen, V. B., Bendix, J., Bardram, L., Rosenberg, J., & Funch-Jensen, P. (2004). Randomized clinical trial of virtual reality simulation for laparoscopic skills training. British Journal of Surgery, 91(2), 146–150.CrossRefGoogle Scholar
  15. Hayes, K. (2011). Work-place based assessment. Obstetrics, Gynaecology, & Reproductive Medicine, 21(2), 52–54.CrossRefGoogle Scholar
  16. Iramaneerat, C., Yudkowsky, R., Myford, C. M., & Downing, S. M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Advances in Health Sciences Education: Theory and Practice, 13(4), 479–493. doi: 10.1007/s10459-007-9060-8.CrossRefGoogle Scholar
  17. Issenberg, S. B., & Scalese, R. J. (2007). Best evidence on high fidelity simulation: What clinical teachers need to know. The Clinical Teacher, 4(2), 73–77. doi: 10.1111/j.1743-498X.2007.00161.x.CrossRefGoogle Scholar
  18. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: Praeger.Google Scholar
  19. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. doi: 10.1111/jedm.12000.CrossRefGoogle Scholar
  20. Ker, J., & Bradley, P. (2014). Simulation in medical education. In T. Swanwick (Ed.), Understanding medical education: Evidence, theory and practice (2nd ed., pp. 175–192). Hoboken, NJ: Wiley.Google Scholar
  21. Ker, J., Hesketh, A., Anderson, F., & Johnston, D. (2005). PRHO views of the usefulness of a pilot ward simulation exercise. Hospital Medicine, 66(3), 168–170.CrossRefGoogle Scholar
  22. Ker, J. S., Hesketh, E. A., Anderson, F., & Johnston, D. A. (2006). Can a ward simulation exercise achieve the realism that reflects the complexity of everyday practice junior doctors encounter? Medical Teacher, 28(4), 330–334. doi: 10.1080/01421590600627623.CrossRefGoogle Scholar
  23. Ker, J., Murphy, D., Anderson, F., Hogg, G., Hesketh, A., Hanslip, J., Kellett, C., et al. (2009, July). Reliability of a diagnostic tool to assess performance of foundation doctors in a ward simulation exercise. Poster presented at the annual scientific meeting of the Association for the Study of Medical Education (ASME). Edinburgh, UK.Google Scholar
  24. Ker, J., & Till, H. (2014). Psychometrics. In P. Dasgupta, K. Ahmed, P. Jaye, & M. Khan (Eds.), Surgical simulation (pp. 95–109). London: Anthem Press.Google Scholar
  25. Linacre, J. M. (1996). Generalizability theory and many-facet Rasch measurement. In G. Engelhard & M. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 85–98). Norwood, NJ: Ablex.Google Scholar
  26. Linacre, J. M. (1997). Communicating examinee measures as expected ratings. Rasch Measurement Transactions, 111(1), 550–551. Retrieved from
  27. Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3, 103–122.Google Scholar
  28. Linacre, J. M. (2001). Generalizability theory and Rasch measurement. Rasch Measurement Transactions, 15, 806–807.Google Scholar
  29. Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106.Google Scholar
  30. Linacre, J. M. (2010). FACETS (Version 3.67.1)[Computer software]. Minnetonka, MN: SWReg Digital River, Inc.Google Scholar
  31. Linacre, J. M., & Wright, B. D. (2002). Construction of measures from many-facet data. Journal of Applied Measurement, 3(4), 486–512.Google Scholar
  32. Linstone, H., & Turoff, M. (Eds.). (1975). The Delphi method: Techniques and applications. Reading, MA: Addison-Wesley.Google Scholar
  33. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15, 158–180.Google Scholar
  34. Maran, N. J., & Glavin, R. J. (2007). Low-to-high-fidelity simulation: What clinical teachers need to know. The Clinical Teacher, 4, 73–77.CrossRefGoogle Scholar
  35. McIlwaine, L. M., McAleer, J. P. G., & Ker, J. S. (2007). Assessment of final year medical students in a simulated ward: Developing content validity for an assessment instrument. International Journal of Clinical Skills, 1(1), 33–35.Google Scholar
  36. McKinley, R. K., Strand, J., Ward, L., Gray, T., Alun-Jones, T., & Miller, H. (2008). Checklists for assessment and certification of clinical procedural skills omit essential competencies: A systematic review. Medical Education, 42(4), 338–349.CrossRefGoogle Scholar
  37. McLeod, R., Mires, G. J., & Ker, J. (2012). Direct observed procedural skills assessment in the undergraduate setting. The Clinical Teacher, 9(4), 228–232. doi: 10.1111/j.1743-498X.2012.00582.x.CrossRefGoogle Scholar
  38. McManus, I. C., Thompson, M., & Mollon, J. (2006). Assessment of examiner leniency and stringency (‘hawk-dove effect’) in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Medical Education, 6, 42. doi: 10.1186/1472-6920-6-42. Retrieved from
  39. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: Macmillan.Google Scholar
  40. Messick, S. (1996). Validity of performance assessment. In G. W. Phillips (Ed.), Technical issues in large-scale performance assessment (pp. 1–18). Washington, DC: U. S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics.Google Scholar
  41. Miller, A., & Archer, J. (2010). Impact of workplace based assessment on doctors’ education and performance: A systematic review. British Medical Journal, 341, c5064. doi: 10.1136/bmj.c5064.CrossRefGoogle Scholar
  42. Miller, D. M., & Linn, R. L. (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24(4), 367–378. doi: 10.1177/01466210022031813.CrossRefGoogle Scholar
  43. Morris, M. C., Gallagher, T. K., & Ridgway, P. F. (2012). Tools used to assess medical students competence in procedural skills at the end of a primary medical degree: A systematic review. Medical Education Online, 17. doi:  10.3402/meo.v17i0.18398. Retrieved from
  44. Murphy, D., Bruce, D., Mercer, S., & Eva, K. W. (2009). The reliability of workplace-based assessment in postgraduate medical education and training: A national evaluation in general practice in the United Kingdom. Advances in Health Sciences Education: Theory and practice, 14(2), 219–232. doi: 10.1007/s10459-008-9104-8.CrossRefGoogle Scholar
  45. Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4, 386–422.Google Scholar
  46. Norcini, J. J. (2003). ABC of learning and teaching in medicine: Work based assessment. British Medical Journal, 326(7392), 753–755. doi: 10.1136/bmj.326.7392.753.CrossRefGoogle Scholar
  47. Norcini, J. J. (2005). Current perspectives in assessment: The assessment of performance at work. Medical Education, 39(9), 880–889. doi: 10.1111/j.1365-2929.2005.02182.x.CrossRefGoogle Scholar
  48. Norcini, J. J., & McKinley, D. W. (2007). Assessment methods in medical education. Teaching and Teacher Education, 23, 239–250.CrossRefGoogle Scholar
  49. Postgraduate Medical Education and Training Board and Academy of Medical Royal Colleges. (2009). Workplace based assessment: A guide for implementation. London: Author. Retrieved from
  50. Roberts, C., Rothnie, I., Zoanetti, N., & Crossley, J. (2010). Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview? Medical Education, 44(7), 690–698. doi: 10.1111/j.1365-2923.2010.03689.x.CrossRefGoogle Scholar
  51. Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2003). The use of clinical simulations in assessment. Medical Education, 37(Suppl 1), 65–71. doi: 10.1046/j.1365-2923.37.s1.8.x.CrossRefGoogle Scholar
  52. Siassakos, D., Bristowe, K., Draycott, T. J., Angouri, J., Hambly, H., Winter, C., et al. (2011). Clinical efficiency in a simulated emergency and relationship to team behaviours: A multisite cross-sectional study. BJOG: An International Journal of Obstetrics and Gynaecology, 118(5), 596–607. doi: 10.1111/j.1471-0528.2010.02843.x.CrossRefGoogle Scholar
  53. Smith, E. V., & Kulikowich, J. M. (2004). An application of generalizability theory and many-facet Rasch measurement using a complex problem-solving assessment. Educational and Psychological Measurement, 64, 617–639. doi: 10.1177/0013164404263876.CrossRefGoogle Scholar
  54. Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 324–328.CrossRefGoogle Scholar
  55. Stiggins, R., & Chappuis, J. (2006). What a difference a word makes. Journal of Staff Development, 27(1), 10–14.Google Scholar
  56. Streiner, D. L., & Norman, G. R. (2008). Health measurement scales: A practical guide to their development and use (4th ed.). New York, NY: Oxford University Press.CrossRefGoogle Scholar
  57. Sturm, L. P., Windsor, J. A., Cosman, P. H., Cregan, P., Hewett, P. J., & Maddern, G. J. (2008). A systematic review of skills transfer after surgical simulation training. Annals of Surgery, 248(2), 166–179.CrossRefGoogle Scholar
  58. Sudweeks, R. R., Reeve, S., & Bradshaw, W. S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9, 239–261.CrossRefGoogle Scholar
  59. Till, H., Myford, C., & Dowell, J. (2013). Improving student selection using multiple mini-interviews with multifaceted Rasch modelling. Academic Medicine, 88(2), 216–223. doi: 10.1097/ACM.0b013e31827c0c5d.CrossRefGoogle Scholar
  60. van der Vleuten, C. P. M., & Dannefer, E. F. (2012). Towards a systems approach to assessment. Medical Teacher, 34(3), 185–186. doi: 10.3109/0142159X.2012.652240.CrossRefGoogle Scholar
  61. van der Vleuten, C. P. M., & Schuwirth, L. W. T. (2005). Assessing professional competence: From methods to programmes. Medical Education, 39, 309–317. doi: 10.1111/j.1365-2929.2005.02094.x.CrossRefGoogle Scholar
  62. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370. Retrieved from
  63. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL: MESA Press.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Hettie Till
    • 1
    • 5
    Email author
  • Jean Ker
    • 2
  • Carol Myford
    • 3
  • Kevin Stirling
    • 4
  • Gary Mires
    • 2
  1. 1.Centre for Medical Education, School of MedicineUniversity of DundeeDundeeUK
  2. 2.School of MedicineUniversity of DundeeDundeeUK
  3. 3.Department of Educational Psychology, College of EducationUniversity of Illinois at ChicagoChicagoUSA
  4. 4.Clinical Skills Centre, School of MedicineUniversity of DundeeDundeeUK
  5. 5.FranschhoekSouth Africa

Personalised recommendations