Skip to main content
Log in

Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Residual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models. We suggest a form of residual analysis that may be applied to assess item fit for unidimensional IRT models. The residual analysis consists of a comparison of the maximum-likelihood estimate of the item characteristic curve with an alternative ratio estimate of the item characteristic curve. The large sample distribution of the residual is proved to be standardized normal when the IRT model fits the data. We compare the performance of our suggested residual to the standardized residual of Hambleton et al. (Fundamentals of item response theory, Sage, Newbury Park, 1991) in a detailed simulation study. We then calculate our suggested residuals using data from an operational test. The residuals appear to be useful in assessing the item fit for unidimensional IRT models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.

Similar content being viewed by others

Notes

  1. By standardization, we refer to dividing the difference of a variable and its expectation by the standard deviation of the difference.

  2. The program is available on request from us.

  3. We also computed our suggested residuals for 15 equispaced values between −2.8 and 2.8 to make these values the same as the midpoints of the intervals used to compute the standardized residuals (Hambleton et al. 1991). The results were virtually unchanged.

  4. Sinharay (2010) reported the average disattenuated correlations among subtest scores from 20+ operational tests. The lowest value reported was 0.69.

  5. We imposed this condition because we noticed that for some easy items, the values of both \(\hat{F}_{j}(\theta)\) and \(\bar {F}_{j}(\theta)\) are larger than 0.99 for 0<θ<2 so that the corresponding residual should not be practically significant, but it is statistically significant.

  6. Note that this reordering was done for convenience. Operationally, the anchor items are interspersed with the operational items.

References

  • American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (1999). Standards for educational and psychological testing. Washington: American Educational Research Association.

    Google Scholar 

  • Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika, 46, 443–459.

    Article  Google Scholar 

  • Bock, R.D., & Haberman, S.J. (2009). Confidence bands for examining goodness-of-fit of estimated item response functions. Paper presented at the annual meeting of the Psychometric Society. Cambridge, UK.

  • Box, G.E.P., & Draper, N.R. (1987). Empirical model-building and response surfaces. New York: Wiley.

    Google Scholar 

  • Chon, K.H., Lee, W., & Dunbar, S.B. (2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47, 318–338.

    Article  Google Scholar 

  • Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley.

    Google Scholar 

  • Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement, 41, 259–268.

    Article  Google Scholar 

  • du Toit, M. (2003). IRT from SSI. Lincolnwood: Scientific Software International.

    Google Scholar 

  • Glas, C.A.W., & Suarez-Falcon, J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106.

    Article  Google Scholar 

  • Haberman, S.J. (1976). Generalized residuals for log-linear models. In Proceedings of the ninth international biometrics conference (Vol. 1, pp. 104–172). Boston: International Biometric Society.

    Google Scholar 

  • Haberman, S.J. (1977a). Log-linear models and frequency tables with small expected cell counts. The Annals of Statistics, 5, 1148–1169.

    Article  Google Scholar 

  • Haberman, S.J. (1977b). Maximum likelihood estimates in exponential response models. The Annals of Statistics, 5, 815–841.

    Article  Google Scholar 

  • Haberman, S.J. (1978). Analysis of qualitative data, Vol. I: Introductory topics. New York: Academic Press.

    Google Scholar 

  • Haberman, S.J. (1979). Analysis of qualitative data, Vol. II: New developments. New York: Academic Press.

    Google Scholar 

  • Haberman, S.J. (1988). A stabilized Newton-Raphson algorithm for log-linear models for frequency tables derived by indirect observation. Sociological Methodology, 18, 193–211.

    Article  Google Scholar 

  • Haberman, S.J. (2006). Adaptive quadrature for item response models (Research Rep. No. RR-06-29). Princeton: ETS.

  • Haberman, S.J. (2009). Use of generalized residuals to examine goodness of fit of item response models (Research Rep. No. RR-09-15). Princeton: ETS.

  • Haberman, S.J., & Sinharay, S. (2012). Assessing goodness of fit of item response theory models using generalized residuals (Unpublished manuscript).

  • Hambleton, R.K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: a five step plan and several graphical displays. In W.R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Washington: Degnon Associates.

    Google Scholar 

  • Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer Academic.

    Google Scholar 

  • Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). In Fundamentals of item response theory. Newbury Park: Sage.

    Google Scholar 

  • Holland, P.W. (1990). The Dutch identity: a new tool for the study of item response models. Psychometrika, 55, 5–18.

    Article  Google Scholar 

  • Kang, T., & Chen, T.T. (2008). Performance of the generalized Sχ 2 item-fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391–406.

    Article  Google Scholar 

  • Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Li, Y., & Rupp, A.A. (2011). Performance of the Sχ 2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 986–1005.

    Article  Google Scholar 

  • Liang, T., Han, T.K., &, Hambleton, R.K. (2009). ResidPlots-2: computer software for IRT graphical residual analyses. Applied Psychological Measurement, 33, 411–412.

    Article  Google Scholar 

  • Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society. Series B, 44.

  • Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47.

  • Mislevy, R.J., & Bock, R.D. (1991). BILOG 3.11 [computer software]. Lincolnwood: Scientific Software International.

    Google Scholar 

  • Muraki, E. (1997). A generalized partial credit model. In W.J. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item response theory. New York: Springer.

    Google Scholar 

  • Muraki, E., & Bock, R.D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Chicago: Scientific Software.

    Google Scholar 

  • Naylor, J.C., & Smith, A.F.M. (1982). Applications of a method for the efficient computation of posterior distributions. Applied Statistics, 31, 214–225.

    Article  Google Scholar 

  • Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

    Article  Google Scholar 

  • Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.

    Book  Google Scholar 

  • Reckase, M.D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36.

    Article  Google Scholar 

  • Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375–394.

    Article  Google Scholar 

  • Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical & Statistical Psychology, 59, 429–449.

    Article  Google Scholar 

  • Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.

    Article  Google Scholar 

  • Stone, C.A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: a comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352.

    Article  Google Scholar 

  • von Davier, M., Sinharay, S., Beaton, A.E., & Oranje, A. (2006). The statistical procedures used in national assessment of educational progress. In C.R. Rao & S. Sinharay (Eds.), Handbook of statistics (Vol. 26, pp. 205–233). Amsterdam: North-Holland.

    Google Scholar 

  • Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339–368.

    Article  Google Scholar 

  • Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandip Sinharay.

Additional information

Note: Any opinions expressed in this publication are those of the authors and not necessarily of Educational Testing Service. Sandip Sinharay conducted this study and wrote this report while on staff at Educational Testing Service. He is currently at CTB/McGraw-Hill.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haberman, S.J., Sinharay, S. & Chon, K.H. Assessing Item Fit for Unidimensional Item Response Theory Models Using Residuals from Estimated Item Response Functions. Psychometrika 78, 417–440 (2013). https://doi.org/10.1007/s11336-012-9305-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-012-9305-1

Key words

Navigation