Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions

  • 1282 Accesses

  • 14 Citations


Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

This is a preview of subscription content, log in to check access.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.


  1. 1.

    By standardization, we refer to dividing the difference of a variable and its expectation by the standard deviation of the difference.

  2. 2.

    The program is available on request from us.

  3. 3.

    We also computed our suggested residuals for 15 equispaced values between −2.8 and 2.8 to make these values the same as the midpoints of the intervals used to compute the standardized residuals (Hambleton et al. 1991). The results were virtually unchanged.

  4. 4.

    Sinharay (2010) reported the average disattenuated correlations among subtest scores from 20+ operational tests. The lowest value reported was 0.69.

  5. 5.

    We imposed this condition because we noticed that for some easy items, the values of both \(\hat{F}_{j}(\theta)\) and \(\bar {F}_{j}(\theta)\) are larger than 0.99 for 0<θ<2 so that the corresponding residual should not be practically significant, but it is statistically significant.

  6. 6.

    Note that this reordering was done for convenience. Operationally, the anchor items are interspersed with the operational items.


  1. American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (1999). Standards for educational and psychological testing. Washington: American Educational Research Association.

  2. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika, 46, 443–459.

  3. Bock, R.D., & Haberman, S.J. (2009). Confidence bands for examining goodness-of-fit of estimated item response functions. Paper presented at the annual meeting of the Psychometric Society. Cambridge, UK.

  4. Box, G.E.P., & Draper, N.R. (1987). Empirical model-building and response surfaces. New York: Wiley.

  5. Chon, K.H., Lee, W., & Dunbar, S.B. (2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47, 318–338.

  6. Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley.

  7. Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement, 41, 259–268.

  8. du Toit, M. (2003). IRT from SSI. Lincolnwood: Scientific Software International.

  9. Glas, C.A.W., & Suarez-Falcon, J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106.

  10. Haberman, S.J. (1976). Generalized residuals for log-linear models. In Proceedings of the ninth international biometrics conference (Vol. 1, pp. 104–172). Boston: International Biometric Society.

  11. Haberman, S.J. (1977a). Log-linear models and frequency tables with small expected cell counts. The Annals of Statistics, 5, 1148–1169.

  12. Haberman, S.J. (1977b). Maximum likelihood estimates in exponential response models. The Annals of Statistics, 5, 815–841.

  13. Haberman, S.J. (1978). Analysis of qualitative data, Vol. I: Introductory topics. New York: Academic Press.

  14. Haberman, S.J. (1979). Analysis of qualitative data, Vol. II: New developments. New York: Academic Press.

  15. Haberman, S.J. (1988). A stabilized Newton-Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociological Methodology, 18, 193–211.

  16. Haberman, S.J. (2006). Adaptive quadrature for item response models (Research Rep. No. RR-06-29). Princeton: ETS.

  17. Haberman, S.J. (2009). Use of generalized residuals to examine goodness of fit of item response models (Research Rep. No. RR-09-15). Princeton: ETS.

  18. Haberman, S.J., & Sinharay, S. (2012). Assessing goodness of fit of item response theory models using generalized residuals (Unpublished manuscript).

  19. Hambleton, R.K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: a five step plan and several graphical displays. In W.R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Washington: Degnon Associates.

  20. Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer Academic.

  21. Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). In Fundamentals of item response theory. Newbury Park: Sage.

  22. Holland, P.W. (1990). The Dutch identity: a new tool for the study of item response models. Psychometrika, 55, 5–18.

  23. Kang, T., & Chen, T.T. (2008). Performance of the generalized Sχ 2 item-fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391–406.

  24. Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer.

  25. Li, Y., & Rupp, A.A. (2011). Performance of the Sχ 2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 986–1005.

  26. Liang, T., Han, T.K., &, Hambleton, R.K. (2009). ResidPlots-2: computer software for IRT graphical residual analyses. Applied Psychological Measurement, 33, 411–412.

  27. Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44.

  28. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47.

  29. Mislevy, R.J., & Bock, R.D. (1991). BILOG 3.11 [computer software]. Lincolnwood: Scientific Software International.

  30. Muraki, E. (1997). A generalized partial credit model. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item response theory. New York: Springer.

  31. Muraki, E., & Bock, R.D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Chicago: Scientific Software.

  32. Naylor, J.C., & Smith, A.F.M. (1982). Applications of a method for the efficient computation of posterior distributions. Applied Statistics, 31, 214–225.

  33. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

  34. Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.

  35. Reckase, M.D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36.

  36. Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375–394.

  37. Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical & Statistical Psychology, 59, 429–449.

  38. Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.

  39. Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: a comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352.

  40. von Davier, M., Sinharay, S., Beaton, A.E., & Oranje, A. (2006). The statistical procedures used in national assessment of educational progress. In C.R. Rao & S. Sinharay (Eds.), Handbook of statistics (Vol. 26, pp. 205–233). Amsterdam: North-Holland.

  41. Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339–368.

  42. Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.

Download references

Author information

Correspondence to Sandip Sinharay.

Additional information

Note: Any opinions expressed in this publication are those of the authors and not necessarily of Educational Testing Service. Sandip Sinharay conducted this study and wrote this report while on staff at Educational Testing Service. He is currently at CTB/McGraw-Hill.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Haberman, S.J., Sinharay, S. & Chon, K.H. Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions. Psychometrika 78, 417–440 (2013). https://doi.org/10.1007/s11336-012-9305-1

Download citation

Key words

  • 2-parameter-logistic model
  • generalized partial credit model
  • item characteristic curve
  • IRT model fit