# Addressing model uncertainty in item response theory person scores through model averaging

- 85 Downloads
- 2 Citations

## Abstract

Item banks are often created in large-scale research and testing settings in the social sciences to predict individuals’ latent trait scores. A common procedure is to fit multiple candidate item response theory (IRT) models to a calibration sample and select a single best-fitting IRT model. The parameter estimates from this model are then used to obtain trait scores for subsequent respondents. However, this model selection procedure ignores *model uncertainty* stemming from the fact that the model ranking in the calibration phase is subject to sampling variability. Consequently, the standard errors of trait scores obtained from subsequent respondents do not reflect such uncertainty. Ignoring such sources of uncertainty contributes to the current replication crisis in the social sciences. In this article, we propose and demonstrate an alternative procedure to account for model uncertainty in this context—*model averaging* of IRT trait scores and their standard errors. We outline the general procedure step-by-step and provide software to aid researchers in implementation, both for large-scale research settings with item banks and for smaller research settings involving IRT scoring. We then demonstrate the procedure with a simulated item-banking illustration, comparing model selection and model averaging within sample in terms of predictive coverage. We conclude by discussing ways that model averaging and IRT scoring can be used and investigated in future research.

## Keywords

Item response theory Model uncertainty Model averaging Item banks## Notes

### Compliance with ethical standards

### Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

## References

- Ames AJ, Penfield RD (2015) An NCME instructional module on item-fit statistics for item response theory models. Educ Meas Issues Pract 34:39–48CrossRefGoogle Scholar
- Anderson EB (1973) A goodness of fit test for the Rasch model. Psychometrika 38:123–140MathSciNetCrossRefGoogle Scholar
- Baker FB, Kim S-H (2004) Item response theory: parameter estimation techniques, 2nd edn. Marcel Dekker, New YorkCrossRefGoogle Scholar
- Bjorner JB, Chang CH, Thissen D, Reeve BB (2007) Developing tailored instruments: item banking and computerized adaptive assessment. Qual Life Res 16:95–108CrossRefGoogle Scholar
- Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46:443–459MathSciNetCrossRefGoogle Scholar
- Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer, BerlinzbMATHGoogle Scholar
- Chalmers RP (2012) Mirt: a multidimensional item response theory package for the R environment. J Stat Softw 48:1–29CrossRefGoogle Scholar
- Cohen AS, Cho S-J (2017) Information criteria. In: van der Linden WJ, Hambleton RK (eds) Handbook of item response theory, models, statistical tools, and applications. CRC, Boca RatonGoogle Scholar
- de Ayala RJ (2009) The theory and practice of item response theory. Guilford Publishing, New YorkGoogle Scholar
- Edelen MO, Tucker JS, Shadel WG, Stucky BD, Cerully J, Zhen L, Hansen M, Cai L (2014) Development of the PROMIS
^{®}health expectancies of smoking item banks. Nicotine Tob Res 16:S222–S230Google Scholar - Hjort NL, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98:879–899MathSciNetCrossRefGoogle Scholar
- Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–401MathSciNetCrossRefGoogle Scholar
- Kaplan D (2016) On the utility of Bayesian model averaging for improving prediction in the social and behavioral sciences. Society of multivariate behavioral research meeting, RichmondGoogle Scholar
- Kaplan D, Lee C (2016) Bayesian model averaging over directed acyclic graphs with implications for the predictive performance of structural equation models. Struct Equ Model 23:343–353MathSciNetCrossRefGoogle Scholar
- Lubke G, Campbell I (2016) Inference based on the best-fitting model can contribute to the replication crisis: assessing model selection uncertainty using a bootstrap approach. Struct Equ Model 23:479–490MathSciNetCrossRefGoogle Scholar
- Lubke G, Campbell I, McArtor D, Miller P, Luningham J, van den Berg S (2017) Assessing model selection uncertainty using a bootstrap approach: an update. Struct Equ Model 24:230–245MathSciNetCrossRefGoogle Scholar
- Meijer RR, Nering ML (1999) Computerized adaptive testing: overview and introduction. Appl Psychol Meas 23:187–194CrossRefGoogle Scholar
- Preacher KJ, Merkle EC (2012) The problem of model selection uncertainty in structural equation modeling. Psychol Methods 17:1–14CrossRefGoogle Scholar
- Reise SP (2012) The rediscovery of bifactor measurement models. Multivar Behav Res 47:667–696CrossRefGoogle Scholar
- Reise SP, Bonifay WE, Haviland MG (2013) Scoring and modeling psychological measures in the presence of multidimensionality. J Pers Assess 95:129–140CrossRefGoogle Scholar
- Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464MathSciNetCrossRefGoogle Scholar
- Sinharay S, Johnson MS, Williamson DM (2003) Calibrating item families and summarizing the results using family expected response functions. J Educ Behav Stat 28:295–313CrossRefGoogle Scholar
- Sterba SK, Rights JD (2017) Effects of parceling on model selection: parcel-allocation variability in model ranking. Psychol Methods 22:47–68CrossRefGoogle Scholar