Addressing model uncertainty in item response theory person scores through model averaging

Rights, Jason D.; Sterba, Sonya K.; Cho, Sun-Joo; Preacher, Kristopher J.

doi:10.1007/s41237-018-0052-1

Addressing model uncertainty in item response theory person scores through model averaging

Short Note
Published: 04 June 2018

Volume 45, pages 495–503, (2018)
Cite this article

Behaviormetrika Aims and scope Submit manuscript

Jason D. Rights¹,
Sonya K. Sterba¹,
Sun-Joo Cho¹ &
…
Kristopher J. Preacher¹

297 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Item banks are often created in large-scale research and testing settings in the social sciences to predict individuals’ latent trait scores. A common procedure is to fit multiple candidate item response theory (IRT) models to a calibration sample and select a single best-fitting IRT model. The parameter estimates from this model are then used to obtain trait scores for subsequent respondents. However, this model selection procedure ignores model uncertainty stemming from the fact that the model ranking in the calibration phase is subject to sampling variability. Consequently, the standard errors of trait scores obtained from subsequent respondents do not reflect such uncertainty. Ignoring such sources of uncertainty contributes to the current replication crisis in the social sciences. In this article, we propose and demonstrate an alternative procedure to account for model uncertainty in this context—model averaging of IRT trait scores and their standard errors. We outline the general procedure step-by-step and provide software to aid researchers in implementation, both for large-scale research settings with item banks and for smaller research settings involving IRT scoring. We then demonstrate the procedure with a simulated item-banking illustration, comparing model selection and model averaging within sample in terms of predictive coverage. We conclude by discussing ways that model averaging and IRT scoring can be used and investigated in future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

When computing Eq. 3, software might round all K values summed in the denominator to 0, yielding an undefined solution. The software we provide in the Appendix accounts for this by using an equivalent mathematical reformulation that is not susceptible to this issue.

References

Ames AJ, Penfield RD (2015) An NCME instructional module on item-fit statistics for item response theory models. Educ Meas Issues Pract 34:39–48
Article Google Scholar
Anderson EB (1973) A goodness of fit test for the Rasch model. Psychometrika 38:123–140
Article MathSciNet Google Scholar
Baker FB, Kim S-H (2004) Item response theory: parameter estimation techniques, 2nd edn. Marcel Dekker, New York
Book Google Scholar
Bjorner JB, Chang CH, Thissen D, Reeve BB (2007) Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16:95–108
Article Google Scholar
Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46:443–459
Article MathSciNet Google Scholar
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, Berlin
MATH Google Scholar
Chalmers RP (2012) Mirt: a multidimensional item response theory package for the R environment. J Stat Softw 48:1–29
Article Google Scholar
Cohen AS, Cho S-J (2017) Information criteria. In: van der Linden WJ, Hambleton RK (eds) Handbook of item response theory, models, statistical tools, and applications. CRC, Boca Raton
Google Scholar
de Ayala RJ (2009) The theory and practice of item response theory. Guilford Publishing, New York
Google Scholar
Edelen MO, Tucker JS, Shadel WG, Stucky BD, Cerully J, Zhen L, Hansen M, Cai L (2014) Development of the PROMIS^® health expectancies of smoking item banks. Nicotine Tob Res 16:S222–S230
Google Scholar
Hjort NL, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98:879–899
Article MathSciNet Google Scholar
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–401
Article MathSciNet Google Scholar
Kaplan D (2016) On the utility of Bayesian model averaging for improving prediction in the social and behavioral sciences. Society of multivariate behavioral research meeting, Richmond
Google Scholar
Kaplan D, Lee C (2016) Bayesian model averaging over directed acyclic graphs with implications for the predictive performance of structural equation models. Struct Equ Model 23:343–353
Article MathSciNet Google Scholar
Lubke G, Campbell I (2016) Inference based on the best-fitting model can contribute to the replication crisis: assessing model selection uncertainty using a bootstrap approach. Struct Equ Model 23:479–490
Article MathSciNet Google Scholar
Lubke G, Campbell I, McArtor D, Miller P, Luningham J, van den Berg S (2017) Assessing model selection uncertainty using a bootstrap approach: an update. Struct Equ Model 24:230–245
Article MathSciNet Google Scholar
Meijer RR, Nering ML (1999) Computerized adaptive testing: overview and introduction. Appl Psychol Meas 23:187–194
Article Google Scholar
Preacher KJ, Merkle EC (2012) The problem of model selection uncertainty in structural equation modeling. Psychol Methods 17:1–14
Article Google Scholar
Reise SP (2012) The rediscovery of bifactor measurement models. Multivar Behav Res 47:667–696
Article Google Scholar
Reise SP, Bonifay WE, Haviland MG (2013) Scoring and modeling psychological measures in the presence of multidimensionality. J Pers Assess 95:129–140
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet Google Scholar
Sinharay S, Johnson MS, Williamson DM (2003) Calibrating item families and summarizing the results using family expected response functions. J Educ Behav Stat 28:295–313
Article Google Scholar
Sterba SK, Rights JD (2017) Effects of parceling on model selection: parcel-allocation variability in model ranking. Psychol Methods 22:47–68
Article Google Scholar

Download references

Author information

Authors and Affiliations

Quantitative Methods Program, Department of Psychology and Human Development, Vanderbilt University, Peabody #552, 230 Appleton Place, Nashville, TN, 37203, USA
Jason D. Rights, Sonya K. Sterba, Sun-Joo Cho & Kristopher J. Preacher

Authors

Jason D. Rights
View author publications
You can also search for this author in PubMed Google Scholar
Sonya K. Sterba
View author publications
You can also search for this author in PubMed Google Scholar
Sun-Joo Cho
View author publications
You can also search for this author in PubMed Google Scholar
Kristopher J. Preacher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jason D. Rights.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Communicated by Ronny Scherer and Marie Wiberg

Appendix: modelavgIRT R function

1.1 modelavgIRT R function description

This function reads in person scores (e.g., EAP scores) and their standard errors from the validation sample, and information criteria values (BIC, AIC) from the calibration sample from each of a set of candidate IRT models and outputs model-averaged person scores and standard errors (see manuscript equations 4 and 5).

1.2 modelavgIRT R function input

personscores:: A data set consisting of person scores obtained from each candidate model in the validation sample, with rows denoting person and columns denoting model
personSEs:: A data set consisting of person score standard errors obtained from each candidate model in the validation sample, with rows denoting person and columns denoting model
selectionindex:: List of information criteria values (BIC, AIC) for each model, in the order of the columns of personscores and personSEs
rescale:: Logical; if set to TRUE (default), prior to averaging, each models’ person scores will be rescaled to have mean of 0 and a variance of 1 and standard errors will be rescaled proportionally

1.3 modelavgIRT R function Code

modelavgIRT <- function(personscores,personSEs,selectionindex,rescale=TRUE) {

##rescale personscores to have mean 0 and var 1

#rescale personSEs proportionally

if(rescale==TRUE){

for(i in seq(ncol(personscores))){

personscores[,i] <- (personscores[,i] - mean(personscores[,i]))/sd(personscores[,i])

personSEs[,i] <- personSEs[,i]/sd(personscores[,i])

}

##compute weights

weights <- c(rep(NA,length(selectionindex)))

for(i in seq(length(selectionindex))){

weights[i] <- sum(exp(-

.5*selectionindex[1:length(selectionindex)]+.5*selectionindex[i]))^(-1)

}

##compute averaged person scores

avg.personscore <- matrix(NA,nrow(personscores),1)

for(i in seq(nrow(personscores))){

avg.personscore[i,] <- sum(weights*personscores[i,])

}

##compute averaged person SEs

avg.personSE <- matrix(NA,nrow(personSEs),1)

for(i in seq(nrow(personSEs))){

avg.personSE[i,] <- sum(weights*sqrt(personSEs[i,]^2+(personscores[i,]-

avg.personscore[i,])^2))

}

output <- list(weights,avg.personscore,avg.personSE)

names(output) <- c(“weights”,”Average person score”,”Average person SE”)

return(output)

}

About this article

Cite this article

Rights, J.D., Sterba, S.K., Cho, SJ. et al. Addressing model uncertainty in item response theory person scores through model averaging. Behaviormetrika 45, 495–503 (2018). https://doi.org/10.1007/s41237-018-0052-1

Download citation

Received: 12 March 2018
Accepted: 18 May 2018
Published: 04 June 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s41237-018-0052-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Addressing model uncertainty in item response theory person scores through model averaging

Abstract

Access this article

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Appendix: modelavgIRT R function

Appendix: modelavgIRT R function

1.1 modelavgIRT R function description

1.2 modelavgIRT R function input

1.3 modelavgIRT R function Code

About this article

Cite this article

Share this article

Keywords

Search

Navigation