, Volume 79, Issue 1, pp 84–104 | Cite as

Additive Multilevel Item Structure Models with Random Residuals: Item Modeling for Explanation and Item Generation

  • Sun-Joo ChoEmail author
  • Paul De Boeck
  • Susan Embretson
  • Sophia Rabe-Hesketh


An additive multilevel item structure (AMIS) model with random residuals is proposed. The model includes multilevel latent regressions of item discrimination and item difficulty parameters on covariates at both item and item category levels with random residuals at both levels. The AMIS model is useful for explanation purposes and also for prediction purposes as in an item generation context. The parameters can be estimated with an alternating imputation posterior algorithm that makes use of adaptive quadrature, and the performance of this algorithm is evaluated in a simulation study.

Key words

alternating imputation posterior with adaptive quadrature item generation multilevel model random item parameters 


  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. CrossRefGoogle Scholar
  2. Albers, W., Does, R.J.M.M., Imbos, T., & Janssen, M.P.E. (1989). A stochastic growth model applied to repeated test of academic knowledge. Psychometrika, 54, 451–466. CrossRefGoogle Scholar
  3. Baker, F.B., & Kim, S.-H. (2004). Item response theory: parameter estimation techniques (2nd ed.). New York: Dekker. Google Scholar
  4. Bejar, I.I. (1993). A generative approach to psychological and educational measurement. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests (pp. 323–359). Hillsdale: Erlbaum. Google Scholar
  5. Bejar, I.I. (2012). Item generation: implications for a validity argument. In M. Gierl & T. Haladyna (Eds.), Automatic item generation, New York: Taylor & Francis. Google Scholar
  6. Bejar, I.I., Lawless, R.R., Morley, M.E., Wagner, M.E., Bennett, R.E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2, 3–28. Google Scholar
  7. Bellio, R., & Brazzale, A.R. (2011). Restricted likelihood inference for generalized linear models. Statistics and Computing, 21, 173–183. CrossRefGoogle Scholar
  8. Birnbaum, A. (1968). Test scores, sufficient statistics, and the information structures of tests. In L. Lord & M. Novick (Eds.), Statistical theories of mental test scores (pp. 425–435). Reading: Addison-Wesley. Google Scholar
  9. Bock, R.D., & Schilling, S.G. (1997). High-dimensional full-information item factor analysis. In M. Berkane (Ed.), Latent variable modelling and applications to causality (pp. 164–176). New York: Springer. Google Scholar
  10. Bormuth, J.R. (1970). On the theory of achievement test items. Chicago: University of Chicago Press. Google Scholar
  11. Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168. CrossRefGoogle Scholar
  12. Breslow, N.E., & Clayton, D.G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9–25. Google Scholar
  13. Breslow, N.E., & Lin, X. (1995). Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika, 82, 81–91. CrossRefGoogle Scholar
  14. Breslow, N.E. (2004). Whither PQL? In D.Y. Lin & P.J. Heagerty (Eds.), Proceedings of the second seattle symposium in biostatistics: analysis of correlated data (pp. 1–22). New York: Springer. CrossRefGoogle Scholar
  15. Browne, W.J., & Draper, D. (2006). A comparison of Bayesian and likelihood methods for fitting multilevel models. Bayesian Analysis, 1, 473–514. CrossRefGoogle Scholar
  16. Chaimongkol, S., Huffer, F.W., & Kamata, A. (2006). A Bayesian approach for fitting a random effect differential item functioning across group units. Thailand Statistician, 4, 27–41. Google Scholar
  17. Cho, S.-J., & Rabe-Hesketh, S. (2011). Alternating imputation posterior estimation of models with crossed random effects. Computational Statistics & Data Analysis, 55, 12–25. CrossRefGoogle Scholar
  18. Cho, S.-J., & Suh, Y. (2012). Bayesian analysis of item response models using WinBUGS 1.4.3. Applied Psychological Measurement, 36, 147–148. CrossRefGoogle Scholar
  19. Cho, S.-J., Athay, M., & Preacher, K.J. (2013). Measuring change for a multidimensional test using a generalized explanatory longitudinal item response model. British Journal of Mathematical & Statistical Psychology, 66, 353–381. CrossRefGoogle Scholar
  20. Cho, S.-J., Gilbert, J.K., & Goodwin, A.P. (2013). Explanatory multidimensional multilevel random item response model: an application to simultaneous investigation of word and person contributions to multidimensional lexical quality. Psychometrika, 78, 830–855. PubMedCrossRefGoogle Scholar
  21. Clayton, D.G., & Rasbash, J. (1999). Estimation in large crossed random-effect models by data augmentation. Journal of the Royal Statistical Society Series A, 162, 425–436. CrossRefGoogle Scholar
  22. Daniel, R.C., & Embretson, S.E. (2010). Designing cognitive complexity in mathematical problem-solving items. Applied Psychological Measurement, 34, 348–364. CrossRefGoogle Scholar
  23. De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559. CrossRefGoogle Scholar
  24. De Jong, M.G., Steenkamp, J.B.E.M., & Fox, J.-P. (2007). Relaxing cross-national measurement invariance using a hierarchical IRT model. Journal of Consumer Research, 34, 260–278. CrossRefGoogle Scholar
  25. De Jong, M.G., Steenkamp, J.B.E.M., Fox, J.-P., & Baumgartner, H. (2008). Using item response theory to measure extreme response style in marketing research: a global investigation. Journal of Marketing Research, 45, 104–115. CrossRefGoogle Scholar
  26. De Jong, M.G., & Steenkamp, J.B.E.M. (2010). Finite mixture multilevel multidimensional ordinal IRT models for large-scale cross-cultural research. Psychometrika, 75, 3–32. CrossRefGoogle Scholar
  27. Embretson, S.E. (1998). A cognitive design system approach to generating valid tests: application to abstract reasoning. Psychological Methods, 3, 300–396. CrossRefGoogle Scholar
  28. Embretson, S.E. (1999). Generating items during testing: psychometric issues and models. Psychometrika, 64, 407–433. CrossRefGoogle Scholar
  29. Embretson, S.E. (2010). Cognitive design systems: a structural modelling approach applied to developing a spatial abtiliy test. In S.E. Embretson (Ed.), Measuring psychological constructs: advances in model-based approaches (pp. 247–273). Washington: American Psychological Association. CrossRefGoogle Scholar
  30. Embretson, S.E., & Daniel, R.C. (2008). Understanding and quantifying cognitive complexity level in mathematical problem solving items. Psychology Science Quarterly, 50, 328–344. Google Scholar
  31. Embretson, S.E., & Gorin, J.S. (2001). Improving construct validity with cognitive psychology principles. Journal of Educational Measurement, 38, 343–368. CrossRefGoogle Scholar
  32. Embretson, S.E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C.R. Rao & S. Sinharay (Eds.), Handbook of statistics: psychometrics (Vol. 26, pp. 747–768). North Holland: Elsevier. CrossRefGoogle Scholar
  33. Fischer, G.H. (1973). Linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374. CrossRefGoogle Scholar
  34. Fox, J.-.P. (2010). Bayesian item response modeling. New York: Springer. CrossRefGoogle Scholar
  35. Frederickx, S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2010). RIM: a random item mixture model to detect differential item functioning. Journal of Educational Measurement, 47, 432–457. CrossRefGoogle Scholar
  36. Freund, Ph.A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied Psychological Measurement, 32, 195–210. CrossRefGoogle Scholar
  37. Geerlings, H., Glas, C.A.W., & van der Linden, W.J. (2011). Modeling rule-based item generation. Psychometrika, 76, 337–359. CrossRefGoogle Scholar
  38. Gierl, M., & Haladyna, T. (2012). Automatic item generation. New York: Taylor & Francis. Google Scholar
  39. Gierl, M., & Lai, H. (2012). Using weak and strong theory to create item models for automatic item generation: some practical guidelines with examples. In M. Gierl & T. Haladyna (Eds.), Automatic item generation, New York: Taylor & Francis. Google Scholar
  40. Gierl, M.J., Zhou, J., & Alves, C.B. (2008). Developing a taxonomy of item model types to promote assessment engineering. The Journal of Technology, Learning, and Assessment, 7, 1–51. Google Scholar
  41. Glas, C.A.W., & van der Linden, W.J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27, 247–261. CrossRefGoogle Scholar
  42. Goldstein, H., & Rasbash, J. (1996). Improved approximations for multilevel models with binary responses. Journal of the Royal Statistical Society Series A, 159, 505–513. CrossRefGoogle Scholar
  43. Goldstein, H. (1991). Nonlinear multilevel models, with an application to discrete response data. Biometrika, 78, 45–51. CrossRefGoogle Scholar
  44. Gorin, J. (2005). Manipulating processing difficulty of reading comprehension questions: the feasibility of verbal item generation. Journal of Educational Measurement, 42, 351–373. CrossRefGoogle Scholar
  45. Gurieroux, C., Holly, A., & Monfort, A. (1982). Likelihood ratio test, Wald test, and Kuhn–Tucker test in linear models with inequality constraints on the regression parameters on the regression parameters. Econometrica, 50, 63–80. CrossRefGoogle Scholar
  46. Holling, H., Bertling, J.P., & Zeuch, N. (2009). Probability word problems: automatic item generation and LLTM modelling. Studies in Educational Evaluation, 35, 71–76. CrossRefGoogle Scholar
  47. Irvine, S.H. & Kyllonen, P. (Eds.) (2002). Item generation for test development. Mahwah: Erlbaum. Google Scholar
  48. Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for criterion-referenced measurement. Journal of Educational and Behavioral Statistics, 25, 285–306. CrossRefGoogle Scholar
  49. Janssen, R., Schepers, J., & Perez, D. (2004). Models with item and item group predictors. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: a generalized linear and nonlinear approach (pp. 189–212). New York: Springer. CrossRefGoogle Scholar
  50. Joe, H. (2008). Accuracy of Laplace approximation for discrete response mixed models. Computational Statistics & Data Analysis, 52, 5066–5074. CrossRefGoogle Scholar
  51. Johnson, M.S., & Sinharay, S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29, 369–400. CrossRefGoogle Scholar
  52. Karim, M.R., & Zeger, S.L. (1992). Generalized linear models with random effects: Salamander mating revisited. Biometrics, 48, 631–644. PubMedCrossRefGoogle Scholar
  53. Klein Entink, R.H., Fox, J.-P., & van der Linden, W.J. (2009a). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74, 21–48. PubMedCrossRefPubMedCentralGoogle Scholar
  54. Klein Entink, R.H., Kuhn, J.-T., Hornke, L.F., & Fox, J.-P. (2009b). Evaluating cognitive theory: a joint modeling approach using responses and response times. Psychological Methods, 14, 54–75. PubMedCrossRefGoogle Scholar
  55. Koehler, E., Brown, E., & Haneuse, S. (2009). On the assessment of Monte Carlo error in simulation-based statistical analyses. American Statistician, 63, 155–162. PubMedCrossRefPubMedCentralGoogle Scholar
  56. Lee, Y., & Nelder, J.A. (1996). Hierarchical generalized linear models (with discussion). Journal of the Royal Statistical Society Series B, 58, 619–678. Google Scholar
  57. Lee, Y., & Nelder, J.A. (2006). Double-hierarchical generalized linear models (with discussion). Journal of the Royal Statistical Society Series C, 55, 1–29. CrossRefGoogle Scholar
  58. Lin, X., & Breslow, N.E. (1996). Bias correction in generalized linear mixed models with multiple components of dispersion. Journal of the American Statistical Association, 91, 1007–1016. CrossRefGoogle Scholar
  59. McGilchrist, C.A. (1994). Estimation in generalized mixed models. Journal of the Royal Statistical Society Series B, 56, 61–69. Google Scholar
  60. Millman, J., & Westman, R.S. (1989). Computer assisted writing of achievement test items: toward a future technology. Journal of Educational Measurement, 26, 177–190. CrossRefGoogle Scholar
  61. Mislevy, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195. CrossRefGoogle Scholar
  62. Mislevy, R.J. (1988). Exploiting auxiliary information about items in the estimation of Rasch item difficulty parameters. Applied Psychological Measurement, 12, 281–296. CrossRefGoogle Scholar
  63. Natarajan, R., & Kass, R.E. (2000). Reference Bayesian methods for generalized linear mixed model. Journal of the American Statistical Association, 95, 227–237. CrossRefGoogle Scholar
  64. Noh, M., & Lee, Y. (2007). REML estimation for binary data in GLMMs. Journal of Multivariate Analysis, 98, 896–915. CrossRefGoogle Scholar
  65. Patterson, H.D., & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554. CrossRefGoogle Scholar
  66. Pinheiro, J.C., & Bates, D.M. (1995). Approximation to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational Graphics and Statistics, 4, 12–35. Google Scholar
  67. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modelling. Psychometrika, 69, 167–190. CrossRefGoogle Scholar
  68. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128, 301–323. CrossRefGoogle Scholar
  69. Rabe-Hesketh, S., & Skrondal, A. (2012). Multilevel and longitudinal modeling using Stata (3rd ed.). College Station: Stata Press. Google Scholar
  70. Rasbash, J., & Browne, W.J. (2007). Non-hierarchical multilevel models. In J. de Leeuw & E. Meijer (Eds.), Handbook of multilevel analysis (pp. 333–336). New York: Springer. Google Scholar
  71. Raudenbush, S.W., Yang, M., & Yosef, M. (2000). Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. Journal of Computational and Graphical Statistics, 9, 141–157. Google Scholar
  72. Rodriguez, G., & Goldman, N. (1995). An assessment of estimation procedures for multilevel models with binary responses. Journal of the Royal Statistical Society Series A, 158, 73–89. CrossRefGoogle Scholar
  73. Rodriguez, G., & Goldman, N. (2001). Improved estimation procedures for multilevel models with binary response: a case study. Journal of the Royal Statistical Society Series A, 164, 339–355. CrossRefGoogle Scholar
  74. Roid, G.H., & Haladyna, T.M. (1982). Toward a technology of test-item writing. New York: Academic. Google Scholar
  75. Schilling, S., & Bock, R.D. (2005). High dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70, 533–555. Google Scholar
  76. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. CrossRefGoogle Scholar
  77. Scrams, D.J., Mislevy, R.J., & Sheehan, K.M. (2002). An analysis of similarities in item functioning within antonym and analogy variant families (RR-02-13). Princeton: Educational Testing Service. Google Scholar
  78. Sinharay, S., Johnson, M.S., & Williamson, D.M. (2003). Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Statistics, 28, 295–313. CrossRefGoogle Scholar
  79. Snijders, T.A.B., & Bosker, R.J. (1994). Modeled variance in two-level models. Sociological Methods & Research, 22, 342–363. CrossRefGoogle Scholar
  80. Soares, T.M., Gonçalvez, F.B., & Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34, 348–377. CrossRefGoogle Scholar
  81. Stram, D.O., & Lee, J.W. (1994). Variance components testing in the longitudinal mixed effect model. Biometrics, 50, 1171–1177. PubMedCrossRefGoogle Scholar
  82. Stram, D.O., & Lee, J.W. (1995). Correction to: variance components testing in the longitudinal mixed-effects model. Biometrics, 51, 1196. Google Scholar
  83. Tanner, M.A., & Wong, W.H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–540. CrossRefGoogle Scholar
  84. Tierney, L., & Kadane, J.B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81, 82–86. CrossRefGoogle Scholar
  85. Vaida, F., & Blanchard, S. (2005). Conditional Akaike information for mixed effects models. Biometrika, 92, 351–370. CrossRefGoogle Scholar
  86. van der Linden, W.J., Klein Entink, R.H., & Fox, J.-P. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34, 327–347. CrossRefGoogle Scholar
  87. Verbeke, G., & Molenberghs, G. (2003). The use of score tests for inference on variance components. Biometrics, 59, 254–262. PubMedCrossRefGoogle Scholar
  88. Wainer, H., Bradlow, E.T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press. CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2013

Authors and Affiliations

  • Sun-Joo Cho
    • 1
    Email author
  • Paul De Boeck
    • 2
    • 3
  • Susan Embretson
    • 4
  • Sophia Rabe-Hesketh
    • 5
    • 6
  1. 1.Vanderbilt UniversityNashvilleUSA
  2. 2.Ohio State UniversityColumbusUSA
  3. 3.KULeuvenDenmark
  4. 4.Georgia Institute of TechnologyAtlantaUSA
  5. 5.University of CaliforniaBerkeleyUSA
  6. 6.Institute of EducationUniversity of LondonLondonUK

Personalised recommendations