Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement
- 3.6k Downloads
Health outcomes researchers are increasingly applying Item Response Theory (IRT) methods to questionnaire development, evaluation, and refinement efforts.
To provide a brief overview of IRT, to review some of the critical issues associated with IRT applications, and to demonstrate the basic features of IRT with an example.
Example data come from 6,504 adolescent respondents in the National Longitudinal Study of Adolescent Health public use data set who completed to the 19-item Feelings Scale for depression. The sample was split into a development and validation sample. Scale items were calibrated in the development sample with the Graded Response Model and the results were used to construct a 10-item short form. The short form was evaluated in the validation sample by examining the correspondence between IRT scores from the short form and the original, and by comparing the proportion of respondents identified as depressed according to the original and short form observed cut scores.
The 19 items varied in their discrimination (slope parameter range: .86–2.66), and item location parameters reflected a considerable range of depression (−.72–3.39). However, the item set is most discriminating at higher levels of depression. In the validation sample IRT scores generated from the short and long forms were correlated at .96 and the average difference in these scores was −.01. In addition, nearly 90% of the sample was classified identically as at risk or not at risk for depression using observed score cut points from the short and long forms.
When used appropriately, IRT can be a powerful tool for questionnaire development, evaluation, and refinement, resulting in precise, valid, and relatively brief instruments that minimize response burden.
KeywordsIRT Health outcomes Adolescent depression Short form
- 1.Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., Thissen, D., Revicki, D. A., Weiss, D. J., Hambleton, R. K., Liu, H., Gershon, R., Reise, S. P., Lai, J.-S., & Cella, D. Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the patient-reported outcomes measurement information system (PROMIS). Medical Care, in press.Google Scholar
- 3.Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff.Google Scholar
- 4.Lord, F. M., (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Earlbaum.Google Scholar
- 5.Wainer, H., Dorans, N. J., Flaugher, R. et al. (1990). Computerized adaptive testing: A primer. Hillsdale NJ: Lawrence Earlbaum Associates.Google Scholar
- 10.Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monography, 34.Google Scholar
- 11.Samejima, F. (1997). Graded response model. In W. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). New York: Springer.Google Scholar
- 13.Hambleton, R. K., Lipscomb, J., Gotay, C. C., & Snyder, C. (2005). Applications of item response theory to improve health outcomes assessment: Developing item banks, linking instruments, and computer-adaptive testing. In Outcomes assessment in cancer: Measures, methods, and applications (pp. 445–464). Cambridge University Press.Google Scholar
- 14.Dorans, N. J. (2007). Linking scores from multiple health outcome instruments. Quality of Life Research, (this issue).Google Scholar
- 15.Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
- 17.Cattell, R. B. (1978). The scientific use of factor analysis. New York: Plenum.Google Scholar
- 18.Loehlin, J. C. (1987). Latent variable models. New Jersey: Lawrence Erlbaum Associates.Google Scholar
- 19.Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
- 20.Teresi, J., & Fleishman, J. (2007). Assessing measurement equivalence across populations: Differential item functioning (DIF). Quality of Life Research, (this issue).Google Scholar
- 21.Chen, W. H., & Thissen, D. (1997). Local dependence indices for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.Google Scholar
- 25.Muraki, E. (1997). A generalized partial credit model. In: van der Linden W & Hambleton RK (eds.), Handbook of modern item response theory (pp. 153–164). New York: Springer.Google Scholar
- 27.Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danmarks Paedagogiske Institut.Google Scholar
- 29.Du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Lincolnwood IL: Scientific Software International.Google Scholar
- 31.Ramsay, J. O. (1995). TestGraf – a program for the graphical analysis of multiple choice test and questionnaire data [computer software]. Montreal: McGill University.Google Scholar
- 32.Thissen, D. (1991). MULTILOG user’s guide: Multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software.Google Scholar
- 36.Wright, B., & Mead, R. (1977). BICAL: Calibrating items and scales with the Rasch model (Research Memorandum No. 23). Chicago IL: University of Chicago, Department of Education, Statistical Laboratory.Google Scholar
- 42.Bjorner, J. B., Christensen, K. B., Orlando, M., & Thissen, D. (2005). Testing the fit of item response theory models for patient reported outcomes. Poster presented at the annual meeting of the International Society of Quality of Life Research. San Francisco, CA, October (2005). .Google Scholar
- 45.Mislevy, R. J., & Bock, R. D. (1986). Bilog: Item analysis and test scoring with binary logistic models. Mooresville, Indiana: Scientific Software.Google Scholar
- 46.Wainer, H., & Mislevy, R. J. (1990). Item response theory, item calibration, and proficiency estimation. In H. Wainer, N. J. Dorans, R. Flaugher et al. (Eds.), Computerized adaptive testing: A primer (pp. 65–101). Hillsdale NJ: Lawrence Earlbaum Associates.Google Scholar
- 51.Linacre, J. M. (1994). Sample size and item calibration stability, Rasch Measurement Transactions, 7(4), 328.Google Scholar
- 55.Thissen, D. (2003). Estimation in multilog. In M. du Toit (Ed.), IRT from SSI: Bilog-MG, multilog, parscale, testfact. Lincolnwood, IL: Scientific Software International.Google Scholar
- 56.Bearman, P. S., Jones, J., & Udry, J. R. (1997). http://www.cpc.unc.edu/projects/addhealth/design/html, The National Longitudinal Study of Adolescent Health: Research Design.
- 59.McLeod, L. D., Swygert, K. A., & Thissen, D. (2001). Factor analysis for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring. Mahwah, New Jersey: Lawrence Earlbaum & Associates.Google Scholar
- 61.Muthén, L. K., & Muthén, B. (1998–2004). Mplus user’s guide. Los Angeles, CA: Muthen & Muthen.Google Scholar
- 62.Steiger, J. H., & Lind, J. (1980). Statistically based tests for the number of common factors. Paper presented at the Psychometrika Society Meeting, Iowa City.Google Scholar
- 65.Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Kollen & J. S. Long (Eds.), Testing structural equation models. Thousand Oaks, CA: Sage.Google Scholar
- 67.Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300.Google Scholar
- 70.Reeve, B. B., & Mâsse, L. C. (2004). Item response theory modeling for questionnaire evaluation. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Sinter (Eds.), Methods for testing and evaluation survey questionnaires (pp. 247–273). Hobeken, NJ: Wiley.CrossRefGoogle Scholar