Abstract
Multidimensional-Method A (M-Method A) has been proposed as an efficient and effective online calibration method for multidimensional computerized adaptive testing (MCAT) (Chen & Xin, Paper presented at the 78th Meeting of the Psychometric Society, Arnhem, The Netherlands, 2013). However, a key assumption of M-Method A is that it treats person parameter estimates as their true values, thus this method might yield erroneous item calibration when person parameter estimates contain non-ignorable measurement errors. To improve the performance of M-Method A, this paper proposes a new MCAT online calibration method, namely, the full functional MLE-M-Method A (FFMLE-M-Method A). This new method combines the full functional MLE (Jones & Jin in Psychometrika 59:59–75, 1994; Stefanski & Carroll in Annals of Statistics 13:1335–1351, 1985) with the original M-Method A in an effort to correct for the estimation error of ability vector that might otherwise adversely affect the precision of item calibration. Two correction schemes are also proposed when implementing the new method. A simulation study was conducted to show that the new method generated more accurate item parameter estimation than the original M-Method A in almost all conditions.
Similar content being viewed by others
Notes
Note that M-Method A, M-OEM, and M-MEM are multidimensional generalizations of the original Method A, OEM and MEM methods in UCAT, respectively.
Because the measurement error model assumes that \({\varvec{\upvarepsilon }}_i \) is independent of \(y_{ij} \), \({{\varvec{\uptheta }} }_i^O \) is independent of \(y_{ij} \).
In classical functional models, the unobserved true values \({{\varvec{\uptheta }} }_i \)’s are regarded as unknown fixed constants or parameters (Carroll et al., 2006, pp. 25)
The rationale of selecting these two sample sizes is as follows. Consistent with Stefanski and Carroll (1985), the number of examinees who answer each new item is varied at two levels (\(n_{j}\) = 300 and 600). Because we simulate m = 30 new items (see Section 4.2) and assume that each examinee only responds to D = 6 new items (see Section 4.3), therefore, the resulting sample sizes are equal to \(n_{j }\times \)(m/D) (i.e., N =1500 and 3000).
When the expected a posterior (EAP) method is employed to update the ability vector estimates, the Bayesian version of the D-optimality strategy is used here, in which the prior covariance matrix \({\varvec{\upvarphi }}\) is set to be \({\varvec{\Omega }}_\theta \).
For the grid search method, 41 points are evenly taken from [-4, 4] of each coordinate dimension, thus the step size is equal to 0.2, and a total of 41\(^{3}\) = 68921 ability points are considered.
Because each examinee answers six new items and his/her MLE of \({\varvec{\uptheta }}\) is updated once via Equation (15) or Equation (16) for each new item he/she answers, six FFMLE_Individual or FFMLE_Mean estimates can be obtained for each examinee; also, we have 100 replications. Thus, to provide the \({\varvec{\uptheta }}\) recovery of the two proposed estimators, we calculate an averaged \({\varvec{\uptheta }}\) taken over 100 replications and 6 estimates for each examinee as his/her new FFMLE estimate before computing the evaluation indictors.
References
Adams, R., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.
Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques \((2^{{\rm nd}}\) Edition). New York: Dekker.
Ban, J.-C., Hanson, B. H., Wang, T. Y., Yi, Q., & Harris, D. J. (2001). A comparative study of on—line pretest item-calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement., 38, 191–212.
Ban, J.-C, Hanson, B. H., Yi, Q., & Harris, D. J. (2002). Data sparseness and online pretest item calibration/scaling methods in CAT (ACT Research Report 02-01). Iowa City, IA, ACT, Inc. Available at http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/19/da/e9.pdf
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R., Novick (Eds.), Statistical theories of mental test scores (pp. 379–479). Reading, MA: Addison-Welsey.
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in nolinear models: A modern perspective (2nd edn). London: Chapman and Hall.
Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52.
Chen, P., Xin, T., Wang, C., & Chang, H. H. (2012). Online calibration methods for the DINA model with independent attributes in CD-CAT. Psychometrika, 77, 201–222.
Chen, P., & Xin, T. (2013). Developing online calibration methods for multidimensional computerized adaptive testing. Paper presented at the 78th Meeting of the Psychometric Society, Arnhem, the Netherlands, July.
Cheng, Y., & Yuan, K. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280–291.
Debeer, D., Buchholz, J., Hartig, J., & Janssen, R. (2014). Student, school, and country differences in sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and Behavioral Statistics, 39, 502–523.
Debeer, D., & Janssen, R. (2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50, 164–185.
Eggen, T. J. H. M., & Verhelst, N. D. (2011). Item calibration in incomplete testing designs. Psicologica, 32, 107–132.
Folk, V. G., & Golub-Smith, M. (1996). Calibration of on-line pretest data using BILOG. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, April.
Haberman, S. J., & von Davier, A. A. (2014). Considerations on parameter estimation, scoring, and linking in multistage testing. In D. L. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 229–248). Boca Raton, FL: CRC Press.
Haberman, S. J. (2009). Linking parameter estimates derived from an item response model through separate calibrations. Research Report RR-09-40. Princeton, NJ: Educational Testing Service.
Hartig, J., & Höhler, J. (2008). Representation of competencies in multidimensional IRT models with within-item and between-item multidimensionality. Journal of Psychology, 216, 89–101.
Hecht, M., Weirich, S., Siegle, T., & Frey, A. (2015). Modeling booklet effects for nonequivalent group designs in large-scale assessment. Educational and Psychological Measurement,. doi:10.1177/0013164414554219.
Hsu, Y., Thompson, T. D., & Chen, W. (1998). CAT item calibration. Paper presented at the annual meeting of the National Council on Measuement in Education, San Diego, CA, April.
Jones, D. H., & Jin, Z. Y. (1994). Optimal sequential designs for on-line item estimation. Psychometrika, 59, 59–75.
Lehmann, E. L., & Casella, G. C. (1998). Theory of point estimation (2nd edn). New York: Springer.
Lien, D.-H. D. (1985). Moments of truncated bivariate log-normal distributions. Economics Letters, 19, 243–247.
Lord, F. M. (1971). Tailored testing, an application of stochastic approximation. Journal of the American Statistical Association, 66, 707–711.
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.
Mislevy, R. J., & Chang, H. (2000). Does adaptive testing violate local independence? Psychometrika, 65, 149–156.
Mulder, J., & van der Linden, W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74, 273–296.
Newman, M. E. J., & Barkema, G. T. (1999). Monte Carlo methods in statistical physics. Oxford: Clarendon Press.
Parshall, C. G. (1998). Item development and pretesting in a computer-based testing environment. Paper presented at the colloquium Computer-Based Testing: Building the Foundation for Future Assessments, Philadelphia, PA, September.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: the art of scientific computing (3rd edn.). New York: Cambridge University Press.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.
Segall, D. O. (2001). General ability measurement: An application of multidimensional item response theory. Psychometrika, 66, 79–97.
Segall, D. O. (2003). Calibrating CAT pools and online pretest items using MCMC methods. Paper presented at the annual meeting of National Council on Measurement in Education, Chicago, IL, April.
Stefanski, L. A., & Carroll, R. J. (1985). Covariate measurement error in logistic regression. Annals of Statistics, 13, 1335–1351.
Stocking, M. L. (1988). Scale drift in on-line calibration (Research Rep. 88-28). Princeton, NJ: ETS.
van der Linden, W. J., & Ren, H. (2014). Optimal Bayesian adaptive design for test-item calibration. Psychometrika. doi:10.1007/s11336-013-9391-8.
Wainer, H., & Mislevy, R. J. (1990). Item response theory, item calibration, and proficiency estimation. In H. Wainer (Ed.), Computerized adaptive testing: A primer (pp. 65–102). Hillsdale, NJ: Erlbaum.
Wang, C. (2014a). On latent trait estimation in multidimensional compensatory item response models. Psychometrika. doi:10.1007/s11336-013-9399-0.
Wang, C. (2014b). Improving measurement precision of hierarchical latent traits using adaptive testing. Journal of Educational and Behavioral Statistics, 39, 452–477.
Wang, C., & Chang, H. H. (2011). Item selection in multidimensional computerized adaptive testing—gaining information from different angles. Psychometrika, 76, 363–384.
Wang, C., & Chang, H. H. (2012). Reducing bias in MIRT trait estimation. Paper presented at the annual meeting of National Council on Measurement in Education, Vancouver, Canada, April.
Wang, C., Chang, H. H., & Boughton, K. A. (2011). Kullback-Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika, 76, 13–39.
Wang, C., Chang, H. H., & Boughton, K. A. (2013). Deriving stopping rules for multidimensional computerized adaptive testing. Applied Psychological Measurement, 37, 99–122.
Yao, L. H. (2013). Comparing the performance of five multidimensional CAT selection procedures with different stopping rules. Applied Psychological Measurement, 37, 3–23.
Yao, L. H., Pommerich, M., & Segall, D. O. (2014). Using multidimensional CAT to administer a short, yet precise, screening test. Applied Psychological Measurement, 38, 614–631.
Acknowledgments
This study was partially supported by the National Natural Science Foundation of China (Grant No. 31300862), the Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20130003120002), the Fundamental Research Funds for the Central Universities (Grant No. 2013YB26), the National Academy of Education/Spencer Fellowship (Grant No. 792269), and KLAS (Grant No. 130026509). Part of the paper was originally presented in 2014 annual meeting of the National Council on Measurement in Education, Philadelphia, Pennsylvania. The authors are indebted to the editor, associate editor, and four anonymous reviewers for their suggestions and comments on the earlier manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Both authors made equal contributions to the paper, and the order of authorship is alphabetical.
Rights and permissions
About this article
Cite this article
Chen, P., Wang, C. A New Online Calibration Method for Multidimensional Computerized Adaptive Testing. Psychometrika 81, 674–701 (2016). https://doi.org/10.1007/s11336-015-9482-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-015-9482-9