, Volume 84, Issue 2, pp 529–553 | Cite as

Restricted Recalibration of Item Response Theory Models

  • Yang LiuEmail author
  • Ji Seung Yang
  • Alberto Maydeu-Olivares


In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.


item response theory measurement invariance cross-validation item calibration pseudo-maximum likelihood residual contingency table goodness of fit 


Supplementary material

11336_2019_9667_MOESM1_ESM.pdf (218 kb)
Supplementary material 1 (pdf 218 KB)
11336_2019_9667_MOESM2_ESM.dat (20 kb)
Supplementary material 2 (dat 19 KB)
11336_2019_9667_MOESM3_ESM.out (13 kb)
Supplementary material 3 (out 13 KB)
Supplementary material 4 (R 15 KB)
11336_2019_9667_MOESM5_ESM.out (15 kb)
Supplementary material 5 (out 14 KB)


  1. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289–300.Google Scholar
  2. Birnbaum, A. (1968). Some latent train models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-Wesley.Google Scholar
  3. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.Google Scholar
  4. Bock, R. D., & Lieberman, M. (1970). Fitting a response model for \(n\) dichotomously scored items. Psychometrika, 35(2), 179–197.Google Scholar
  5. Bock, R. D., & Zimowski, M. F. (1997). Multiple group irt. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York: Springer.Google Scholar
  6. Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems: I. Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290–302.Google Scholar
  7. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.Google Scholar
  8. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168.Google Scholar
  9. Breithaupt, K., Ariel, A. A., & Hare, D. R. (2010). Assembling an inventory of multistage adaptive testing systems. In W. J. van der Linden & C. A. Glas (Eds.), Elements of adaptive testing (pp. 247–266). New York, NY: Springer.Google Scholar
  10. Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108–132.Google Scholar
  11. Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276.Google Scholar
  12. Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173–194.Google Scholar
  13. Cheng, Y., & Yuan, K.-H. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280–291.Google Scholar
  14. Cochran, W. G. (1952). The \({\chi }^{2}\) test of goodness of fit. The Annals of Mathematical Statistics, 23(3), 315–345.Google Scholar
  15. Cressie, N., & Read, T. R. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B (Methodological), 46(3), 440–464.Google Scholar
  16. Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100.Google Scholar
  17. Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19(2), 143–166.Google Scholar
  18. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.Google Scholar
  19. Fox, J.-P. (2005). Multilevel irt using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172.Google Scholar
  20. Glas, C. A. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546.Google Scholar
  21. Glas, C. A. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64(3), 273–294.Google Scholar
  22. Glas, C. A., & Suárez Falcón, J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. Scholar
  23. Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9(4), 861–869.Google Scholar
  24. Gunsjö, A. (1994). Faktoranalys av ordinala variabler. Stockholm: Acta Universitatis Upsaliensis.Google Scholar
  25. Haberman, S. J. (2006). Adaptive quadrature for item response models. ETS Research Report Series, 2006(2), 1–10.Google Scholar
  26. Haberman, S. J., & Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108(504), 1435–1444.Google Scholar
  27. Haberman, S. J., Sinharay, S., & Chon, K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440.Google Scholar
  28. Haley, S. M., Ni, P., Jette, A. M., Tao, W., Moed, R., Meyers, D., et al. (2009). Replenishing a computerized adaptive test of patient-reported daily activity functioning. Quality of Life Research, 18(4), 461–471.Google Scholar
  29. Hofer, S. M., & Piccinin, A. M. (2009). Integrative data analysis through coordination of measurement and analysis protocol across independent longitudinal studies. Psychological Methods, 14(2), 150–164.Google Scholar
  30. Joe, H., & Maydeu-Olivares, A. (2006). On the asymptotic distribution of pearson’s x2 in cross-validation samples. Psychometrika, 71(3), 587–592.Google Scholar
  31. Joe, H., & Maydeu-Olivares, A. (2010). A general family of limited information goodness-of-fit statistics for multinomial data. Psychometrika, 75(3), 393–419.Google Scholar
  32. Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387.Google Scholar
  33. Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381.Google Scholar
  34. Lai, J.-S., Stucky, B. D., Thissen, D., Varni, J. W., DeWitt, E. M., Irwin, D. E., et al. (2013). Development and psychometric properties of the promisÂő pediatric fatigue item banks. Quality of Life Research, 22(9), 2417–2427. Scholar
  35. Liu, Y., & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354–371.Google Scholar
  36. Liu, Y., & Thissen, D. (2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670–688.Google Scholar
  37. Liu, Y., & Thissen, D. (2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496–513.Google Scholar
  38. Liu, Y., & Yang, J. S. (2017). Interval estimation of latent variable scores in item response theory. Journal of Educational and Behavioral Statistics.
  39. Liu, Y., & Yang, J. S. (2018). Bootstrap-calibrated interval estimates for latent variable scores in item response theory. Psychometrika, 83(2), 333–354.Google Scholar
  40. Luecht, R. M. (2006). Operational issues in computer-based testing. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the internet: Issues and advances (pp. 91–114). New York: Wiley.Google Scholar
  41. Magnus, J., & Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley.Google Scholar
  42. Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in \(2^{n}\) contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020.Google Scholar
  43. Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71(4), 713–732.Google Scholar
  44. Maydeu-Olivares, A., & Joe, H. (2008). An overview of limited information goodness-of-fit testing in multidimensional contingency tables. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New trends in psychometrics (pp. 253–262). Tokyo: Universal Academy Press.Google Scholar
  45. Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328.Google Scholar
  46. Maydeu-Olivares, A., & Liu, Y. (2015). Item diagnostics in multivariate discrete data. Psychological Methods, 20(2), 276–292.Google Scholar
  47. McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100–117.Google Scholar
  48. Meng, X.-L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831–860.Google Scholar
  49. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543.Google Scholar
  50. Mosier, C. I. (1951). Symposium: The need and means of cross-validation. i. Problems and designs of cross-validation. Educational and Psychological Measurement, 11(1), 5–11.Google Scholar
  51. Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 43(4), 551–560.Google Scholar
  52. Muthén, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22(1–2), 43–65.Google Scholar
  53. Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115–132.Google Scholar
  54. Muthén, B. (1993). Goodness of fit with categorical and other nonnormal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205–234). Newbury Park, CA: Sage.Google Scholar
  55. Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide [Computer software manual]. Los Angeles, CA.Google Scholar
  56. Parke, W. R. (1986). Pseudo maximum likelihood estimation: The asymptotic distribution. The Annals of Statistics, 14(1), 355–357.Google Scholar
  57. R Core Team. (2018). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from
  58. Ranger, J., & Kuhn, J.-T. (2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49(3), 247–268.Google Scholar
  59. Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley.Google Scholar
  60. Read, T. R. (1984). Closer asymptotic approximations for the distributions of the power divergence goodness-of-fit statistics. Annals of the Institute of Statistical Mathematics, 36(1), 59–69.Google Scholar
  61. Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61(3), 509–528.Google Scholar
  62. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12(4), 1151–1172.Google Scholar
  63. Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.Google Scholar
  64. Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63–84.Google Scholar
  65. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph No. 17. Richmond, VA: Psychometric Society.Google Scholar
  66. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555.Google Scholar
  67. Thissen, D., Liu, Y., Magnus, B., & Quinn, H. (2015). Extending the use of multidimensional IRT calibration as projection: Many-to-one linking and linear computation of projected scores. In Quantitative psychology research (pp. 1–16). Springer.Google Scholar
  68. Thissen, D., & Steinberg, L. (2009). Item response theory. In R. Millsap & A. Maydeu-Olivares (Eds.), The sage handbook of quantitative methods in psychology (pp. 148–177). London: Sage Publications.Google Scholar
  69. Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.Google Scholar
  70. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  71. Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., & DeWalt, D. A. (2011). Using the PedsQLtm 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505.Google Scholar
  72. van der Vaart, A. W. (2000). Asymptotic statistics. New York: Cambridge University Press.Google Scholar
  73. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer. (ISBN 0-387-95457-0).Google Scholar
  74. von Davier, M., & von Davier, A. A. (2007). A unified approach to IRT scale linking and scale transformations. Methodology, 3(3), 115–124.Google Scholar
  75. Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307–330.Google Scholar
  76. Yang, J. S., Hansen, M., & Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and psychological measurement, 72(2), 264–290.Google Scholar
  77. Zhao, Y., & Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. Canadian Journal of Statistics, 33(3), 335–356.Google Scholar

Copyright information

© The Psychometric Society 2019

Authors and Affiliations

  • Yang Liu
    • 1
    Email author
  • Ji Seung Yang
    • 1
  • Alberto Maydeu-Olivares
    • 2
    • 3
  1. 1.Department of Human Development and Quantitative MethodologyUniversity of MarylandCollege ParkUSA
  2. 2.Department of PsychologyUniversity of South CarolinaColumbiaUSA
  3. 3.Department of PsychologyUniversity of BarcelonaBarcelonaSpain

Personalised recommendations