Skip to main content

Addressing model uncertainty in item response theory person scores through model averaging


Item banks are often created in large-scale research and testing settings in the social sciences to predict individuals’ latent trait scores. A common procedure is to fit multiple candidate item response theory (IRT) models to a calibration sample and select a single best-fitting IRT model. The parameter estimates from this model are then used to obtain trait scores for subsequent respondents. However, this model selection procedure ignores model uncertainty stemming from the fact that the model ranking in the calibration phase is subject to sampling variability. Consequently, the standard errors of trait scores obtained from subsequent respondents do not reflect such uncertainty. Ignoring such sources of uncertainty contributes to the current replication crisis in the social sciences. In this article, we propose and demonstrate an alternative procedure to account for model uncertainty in this context—model averaging of IRT trait scores and their standard errors. We outline the general procedure step-by-step and provide software to aid researchers in implementation, both for large-scale research settings with item banks and for smaller research settings involving IRT scoring. We then demonstrate the procedure with a simulated item-banking illustration, comparing model selection and model averaging within sample in terms of predictive coverage. We conclude by discussing ways that model averaging and IRT scoring can be used and investigated in future research.

This is a preview of subscription content, access via your institution.


  1. When computing Eq. 3, software might round all K values summed in the denominator to 0, yielding an undefined solution. The software we provide in the Appendix accounts for this by using an equivalent mathematical reformulation that is not susceptible to this issue.


  • Ames AJ, Penfield RD (2015) An NCME instructional module on item-fit statistics for item response theory models. Educ Meas Issues Pract 34:39–48

    Article  Google Scholar 

  • Anderson EB (1973) A goodness of fit test for the Rasch model. Psychometrika 38:123–140

    MathSciNet  Article  Google Scholar 

  • Baker FB, Kim S-H (2004) Item response theory: parameter estimation techniques, 2nd edn. Marcel Dekker, New York

    Book  Google Scholar 

  • Bjorner JB, Chang CH, Thissen D, Reeve BB (2007) Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16:95–108

    Article  Google Scholar 

  • Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46:443–459

    MathSciNet  Article  Google Scholar 

  • Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, Berlin

    MATH  Google Scholar 

  • Chalmers RP (2012) Mirt: a multidimensional item response theory package for the R environment. J Stat Softw 48:1–29

    Article  Google Scholar 

  • Cohen AS, Cho S-J (2017) Information criteria. In: van der Linden WJ, Hambleton RK (eds) Handbook of item response theory, models, statistical tools, and applications. CRC, Boca Raton

    Google Scholar 

  • de Ayala RJ (2009) The theory and practice of item response theory. Guilford Publishing, New York

    Google Scholar 

  • Edelen MO, Tucker JS, Shadel WG, Stucky BD, Cerully J, Zhen L, Hansen M, Cai L (2014) Development of the PROMIS® health expectancies of smoking item banks. Nicotine Tob Res 16:S222–S230

    Google Scholar 

  • Hjort NL, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98:879–899

    MathSciNet  Article  Google Scholar 

  • Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–401

    MathSciNet  Article  Google Scholar 

  • Kaplan D (2016) On the utility of Bayesian model averaging for improving prediction in the social and behavioral sciences. Society of multivariate behavioral research meeting, Richmond

    Google Scholar 

  • Kaplan D, Lee C (2016) Bayesian model averaging over directed acyclic graphs with implications for the predictive performance of structural equation models. Struct Equ Model 23:343–353

    MathSciNet  Article  Google Scholar 

  • Lubke G, Campbell I (2016) Inference based on the best-fitting model can contribute to the replication crisis: assessing model selection uncertainty using a bootstrap approach. Struct Equ Model 23:479–490

    MathSciNet  Article  Google Scholar 

  • Lubke G, Campbell I, McArtor D, Miller P, Luningham J, van den Berg S (2017) Assessing model selection uncertainty using a bootstrap approach: an update. Struct Equ Model 24:230–245

    MathSciNet  Article  Google Scholar 

  • Meijer RR, Nering ML (1999) Computerized adaptive testing: overview and introduction. Appl Psychol Meas 23:187–194

    Article  Google Scholar 

  • Preacher KJ, Merkle EC (2012) The problem of model selection uncertainty in structural equation modeling. Psychol Methods 17:1–14

    Article  Google Scholar 

  • Reise SP (2012) The rediscovery of bifactor measurement models. Multivar Behav Res 47:667–696

    Article  Google Scholar 

  • Reise SP, Bonifay WE, Haviland MG (2013) Scoring and modeling psychological measures in the presence of multidimensionality. J Pers Assess 95:129–140

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    MathSciNet  Article  Google Scholar 

  • Sinharay S, Johnson MS, Williamson DM (2003) Calibrating item families and summarizing the results using family expected response functions. J Educ Behav Stat 28:295–313

    Article  Google Scholar 

  • Sterba SK, Rights JD (2017) Effects of parceling on model selection: parcel-allocation variability in model ranking. Psychol Methods 22:47–68

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jason D. Rights.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Communicated by Ronny Scherer and Marie Wiberg

Appendix: modelavgIRT R function

Appendix: modelavgIRT R function

modelavgIRT R function description

This function reads in person scores (e.g., EAP scores) and their standard errors from the validation sample, and information criteria values (BIC, AIC) from the calibration sample from each of a set of candidate IRT models and outputs model-averaged person scores and standard errors (see manuscript equations 4 and 5).

modelavgIRT R function input


A data set consisting of person scores obtained from each candidate model in the validation sample, with rows denoting person and columns denoting model


A data set consisting of person score standard errors obtained from each candidate model in the validation sample, with rows denoting person and columns denoting model


List of information criteria values (BIC, AIC) for each model, in the order of the columns of personscores and personSEs


Logical; if set to TRUE (default), prior to averaging, each models’ person scores will be rescaled to have mean of 0 and a variance of 1 and standard errors will be rescaled proportionally

modelavgIRT R function Code

modelavgIRT <- function(personscores,personSEs,selectionindex,rescale=TRUE) {

##rescale personscores to have mean 0 and var 1

#rescale personSEs proportionally


for(i in seq(ncol(personscores))){

personscores[,i] <- (personscores[,i] - mean(personscores[,i]))/sd(personscores[,i])

personSEs[,i] <- personSEs[,i]/sd(personscores[,i])



##compute weights

weights <- c(rep(NA,length(selectionindex)))

for(i in seq(length(selectionindex))){

weights[i] <- sum(exp(-



##compute averaged person scores

avg.personscore <- matrix(NA,nrow(personscores),1)

for(i in seq(nrow(personscores))){

avg.personscore[i,] <- sum(weights*personscores[i,])


##compute averaged person SEs

avg.personSE <- matrix(NA,nrow(personSEs),1)

for(i in seq(nrow(personSEs))){

avg.personSE[i,] <- sum(weights*sqrt(personSEs[i,]^2+(personscores[i,]-



output <- list(weights,avg.personscore,avg.personSE)

names(output) <- c(“weights”,”Average person score”,”Average person SE”)



About this article

Verify currency and authenticity via CrossMark

Cite this article

Rights, J.D., Sterba, S.K., Cho, SJ. et al. Addressing model uncertainty in item response theory person scores through model averaging. Behaviormetrika 45, 495–503 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Item response theory
  • Model uncertainty
  • Model averaging
  • Item banks