Advertisement

Psychometrika

, Volume 78, Issue 3, pp 417–440 | Cite as

Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions

  • Shelby J. Haberman
  • Sandip SinharayEmail author
  • Kyong Hee Chon
Article

Abstract

Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

Key words

2-parameter-logistic model generalized partial credit model item characteristic curve IRT model fit 

References

  1. American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (1999). Standards for educational and psychological testing. Washington: American Educational Research Association. Google Scholar
  2. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika, 46, 443–459. CrossRefGoogle Scholar
  3. Bock, R.D., & Haberman, S.J. (2009). Confidence bands for examining goodness-of-fit of estimated item response functions. Paper presented at the annual meeting of the Psychometric Society. Cambridge, UK. Google Scholar
  4. Box, G.E.P., & Draper, N.R. (1987). Empirical model-building and response surfaces. New York: Wiley. Google Scholar
  5. Chon, K.H., Lee, W., & Dunbar, S.B. (2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47, 318–338. CrossRefGoogle Scholar
  6. Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley. Google Scholar
  7. Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement, 41, 259–268. CrossRefGoogle Scholar
  8. du Toit, M. (2003). IRT from SSI. Lincolnwood: Scientific Software International. Google Scholar
  9. Glas, C.A.W., & Suarez-Falcon, J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. CrossRefGoogle Scholar
  10. Haberman, S.J. (1976). Generalized residuals for log-linear models. In Proceedings of the ninth international biometrics conference (Vol. 1, pp. 104–172). Boston: International Biometric Society. Google Scholar
  11. Haberman, S.J. (1977a). Log-linear models and frequency tables with small expected cell counts. The Annals of Statistics, 5, 1148–1169. CrossRefGoogle Scholar
  12. Haberman, S.J. (1977b). Maximum likelihood estimates in exponential response models. The Annals of Statistics, 5, 815–841. CrossRefGoogle Scholar
  13. Haberman, S.J. (1978). Analysis of qualitative data, Vol. I: Introductory topics. New York: Academic Press. Google Scholar
  14. Haberman, S.J. (1979). Analysis of qualitative data, Vol. II: New developments. New York: Academic Press. Google Scholar
  15. Haberman, S.J. (1988). A stabilized Newton-Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociological Methodology, 18, 193–211. CrossRefGoogle Scholar
  16. Haberman, S.J. (2006). Adaptive quadrature for item response models (Research Rep. No. RR-06-29). Princeton: ETS. Google Scholar
  17. Haberman, S.J. (2009). Use of generalized residuals to examine goodness of fit of item response models (Research Rep. No. RR-09-15). Princeton: ETS. Google Scholar
  18. Haberman, S.J., & Sinharay, S. (2012). Assessing goodness of fit of item response theory models using generalized residuals (Unpublished manuscript). Google Scholar
  19. Hambleton, R.K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: a five step plan and several graphical displays. In W.R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Washington: Degnon Associates. Google Scholar
  20. Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer Academic. Google Scholar
  21. Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). In Fundamentals of item response theory. Newbury Park: Sage. Google Scholar
  22. Holland, P.W. (1990). The Dutch identity: a new tool for the study of item response models. Psychometrika, 55, 5–18. CrossRefGoogle Scholar
  23. Kang, T., & Chen, T.T. (2008). Performance of the generalized Sχ 2 item-fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391–406. CrossRefGoogle Scholar
  24. Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer. CrossRefGoogle Scholar
  25. Li, Y., & Rupp, A.A. (2011). Performance of the Sχ 2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 986–1005. CrossRefGoogle Scholar
  26. Liang, T., Han, T.K., &, Hambleton, R.K. (2009). ResidPlots-2: computer software for IRT graphical residual analyses. Applied Psychological Measurement, 33, 411–412. CrossRefGoogle Scholar
  27. Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44. Google Scholar
  28. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47. Google Scholar
  29. Mislevy, R.J., & Bock, R.D. (1991). BILOG 3.11 [computer software]. Lincolnwood: Scientific Software International. Google Scholar
  30. Muraki, E. (1997). A generalized partial credit model. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item response theory. New York: Springer. Google Scholar
  31. Muraki, E., & Bock, R.D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Chicago: Scientific Software. Google Scholar
  32. Naylor, J.C., & Smith, A.F.M. (1982). Applications of a method for the efficient computation of posterior distributions. Applied Statistics, 31, 214–225. CrossRefGoogle Scholar
  33. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. CrossRefGoogle Scholar
  34. Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley. CrossRefGoogle Scholar
  35. Reckase, M.D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36. CrossRefGoogle Scholar
  36. Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375–394. CrossRefGoogle Scholar
  37. Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical & Statistical Psychology, 59, 429–449. CrossRefGoogle Scholar
  38. Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174. CrossRefGoogle Scholar
  39. Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: a comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. CrossRefGoogle Scholar
  40. von Davier, M., Sinharay, S., Beaton, A.E., & Oranje, A. (2006). The statistical procedures used in national assessment of educational progress. In C.R. Rao & S. Sinharay (Eds.), Handbook of statistics (Vol. 26, pp. 205–233). Amsterdam: North-Holland. Google Scholar
  41. Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339–368. CrossRefGoogle Scholar
  42. Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262. CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2012

Authors and Affiliations

  • Shelby J. Haberman
    • 1
  • Sandip Sinharay
    • 3
    Email author
  • Kyong Hee Chon
    • 2
  1. 1.Research & DevelopmentEducational Testing ServicePrincetonUSA
  2. 2.Educational Administration, Leadership, and ResearchWestern Kentucky UniversityBowling GreenUSA
  3. 3.CTB/McGraw-HillMontereyUSA

Personalised recommendations