A Hierarchical Model for Accuracy and Choice on Standardized Tests

Culpepper, Steven Andrew; Balamuta, James Joseph

doi:10.1007/s11336-015-9484-7

A Hierarchical Model for Accuracy and Choice on Standardized Tests

Published: 25 November 2015

Volume 82, pages 820–845, (2017)
Cite this article

Psychometrika Aims and scope Submit manuscript

Steven Andrew Culpepper¹ &
James Joseph Balamuta¹

1102 Accesses
3 Citations
Explore all metrics

“A people however, who are possessed of the spirit of commerce, who see, and who will pursue their advantages, may achieve almost anything.”

-George Washington, 1784, Letter to Benjamin Harrison.

Abstract

This paper assesses the psychometric value of allowing test-takers choice in standardized testing. New theoretical results examine the conditions where allowing choice improves score precision. A hierarchical framework is presented for jointly modeling the accuracy of cognitive responses and item choices. The statistical methodology is disseminated in the ‘cIRT’ R package. An ‘answer two, choose one’ (A2C1) test administration design is introduced to avoid challenges associated with nonignorable missing data. Experimental results suggest that the A2C1 design and payout structure encouraged subjects to choose items consistent with their cognitive trait levels. Substantively, the experimental data suggest that item choices yielded comparable information and discrimination ability as cognitive items. Given there are no clear guidelines for writing more or less discriminating items, one practical implication is that choice can serve as a mechanism to improve score precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Item Response Theory

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

An R toolbox for score-based measurement invariance tests in IRT models

Article Open access 16 December 2021

References

Albert, J. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational and Behavioral Statistics, 17(3), 251–269.
Article Google Scholar
Allen, N., Holland, P., & Thayer, D. (2005). Measuring the benefits of examinee-selected questions. Journal of Educational Measurement, 42, 27–51.
Article Google Scholar
Azzalini, A., & Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715–726.
Article Google Scholar
Béguin, A. A., & Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–561.
Article Google Scholar
Böckenholt, U. (2001). Hierarchical modeling of paired comparison data. Psychological Methods, 6(1), 49.
Article PubMed Google Scholar
Böckenholt, U. (2004). Comparative judgments as an alternative to ratings: Identifying the scale origin. Psychological Methods, 9(4), 453.
Article PubMed Google Scholar
Böckenholt, U. (2006). Thurstonian-based analyses: Past, present, and future utilities. Psychometrika, 71(4), 615–629.
Article PubMed Google Scholar
Bradlow, E., & Thomas, N. (1998). Item response theory models applied to data allowing examinee choice. Journal of Educational and Behavioral Statistics, 23, 236–243.
Article Google Scholar
Bridgeman, B., Morgan, R., & Wang, M.-M. (1997). Choice among essay topics: Impact on performance and validity. Journal of Educational Measurement, 34(3), 273–286.
Article Google Scholar
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.
Google Scholar
Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502.
Article Google Scholar
Carmona, R. (2009). Indifference pricing: Theory and applications. Princeton, NJ: Princeton University Press.
Google Scholar
Cattelan, M., et al. (2012). Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, 27(3), 412–433.
Article Google Scholar
Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16(1), 13–37.
Article Google Scholar
Croson, R. (2005). The method of experimental economics. International Negotiation, 10, 131–148.
Article Google Scholar
Culpepper, S.A. (2015). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika.
Eddelbuettel, D. (2013). Seamless R and C++ integration with Rcpp. New York: Springer.
Book Google Scholar
Fox, J.-P. (2010). Bayesian item response modeling. New York: Springer.
Book Google Scholar
Guay, R. (1976). Purdue spatial visualization test. West Layfette, IN: Purdue University.
Google Scholar
Hakstian, A. R., & Kansup, W. (1975). A comparison of several methods of assessing partial knowledge in multiple choice tests: II Testing procedures. Journal of Educational Measurement, 12(4), 231–239.
Article Google Scholar
Hontangas, P., Ponsado, V., Olea, J., & Wise, S. (2000). The choice of item difficulty in self-adapted testing. European Journal of Psychological Assessment, 16, 3–12.
Article Google Scholar
Kahneman, D. (2003). Maps of bounded rationality: Psychology for behavioral economics. American Economic Review, 93, 1449–1475.
Article Google Scholar
Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1990). Experimental tests of the endowment effect and the Coase theorem. Journal of Political Economy, 98, 1325–1348.
Article Google Scholar
Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1991). Anomalies: The endowment effect, loss aversion, and status quo bias. The Journal of Economic Perspectives, 5, 193–206.
Article Google Scholar
Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250.
Article Google Scholar
Maeda, Y., & Yoon, S. (2013). A meta-analysis on gender differences in mental rotation ability measured by the Purdue spatial visualization tests: Visualization of rotations (PSVT:R). Educational Psychology Review, 25, 69–94.
Article Google Scholar
Maeda, Y., Yoon, S. Y., Kim-Kang, G., & Imbrie, P. (2013). Psychometric properties of the revised PSVT: R for measuring first year engineering students’ spatial ability. International Journal of Engineering Education, 29(3), 763–776.
Google Scholar
Maydeu-Olivares, A., & Böckenholt, U. (2005). Structural equation modeling of paired-comparison and ranking data. Psychological Methods, 10(3), 285.
Article PubMed Google Scholar
McFadden, D. (2001). Economic choices. American Economic Review, 91, 351–378.
Article Google Scholar
Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342–366.
Article Google Scholar
Pitkin, A., & Vispoel, W. (2001). Differences between self-adapted and computerized adaptive tests: A meta-analysis. Journal of Educational Measurement, 38, 235–247.
Article Google Scholar
Powers, D., & Bennett, R. (2000). Effects of allowing examinees to select questions on a test of divergent thinking. Applied Measurement in Education, 12, 257–279.
Article Google Scholar
Revuelta, J. (2004). Estimating ability and item-selection strategy in self-adapted testing: A latent class approach. Journal of Educational and Behavioral Statistics, 29, 379–396.
Article Google Scholar
Rocklin, T. (1994). Self-adapted testing. Applied Measurement in Education, 7, 3–14.
Article Google Scholar
Rocklin, T., & O’Donnell, A. (1987). Self-adapted testing: A performance-improving variant of computerized adaptive testing. Journal of Educational Psychology, 79, 315–319.
Article Google Scholar
Rocklin, T., O’Donnell, A., & Holst, P. (1995). Effects and underlying mechanisms of self-adapted testing. Journal of Educational Psychology, 87, 103–116.
Article Google Scholar
Ross, S. (2011). An elementary introduction to mathematical finance (3rd ed.). New York: Cambridge University Press.
Book Google Scholar
Rubin, D. (1976). Inference and missing data. Biometrika, 63, 581–592.
Article Google Scholar
Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist, 55(1), 68.
Article PubMed Google Scholar
Schraw, G., Flowerday, T., & Reisetter, M. (1998). The role of choice in reader engagement. Journal of Educational Psychology, 90, 705–714.
Article Google Scholar
Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30(4), 298–321.
Article Google Scholar
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273.
Article Google Scholar
Tsai, R.-C. (2000). Remarks on the identifiability of Thurstonian ranking models: Case V, Case III, or neither? Psychometrika, 65(2), 233–240.
Article Google Scholar
Tsai, R.-C. (2003). Remarks on the identifiability of Thurstonian paired comparison models under multiple judgment. Psychometrika, 68(3), 361–372.
Article Google Scholar
Tsai, R.-C., & Böckenholt, U. (2002). Two-level linear paired comparison models: Estimation and identifiability issues. Mathematical Social Sciences, 43(3), 429–449.
Article Google Scholar
Tsai, R.-C., & Böckenholt, U. (2006). Modelling intransitive preferences: A random-effects approach. Journal of Mathematical Psychology, 50(1), 1–14.
Article Google Scholar
Tsai, R.-C., & Böckenholt, U. (2008). On the importance of distinguishing between within-and between-subject effects in intransitive intertemporal choice. Journal of Mathematical Psychology, 52(1), 10–20.
Article Google Scholar
Tversky, A., & Kahneman, D. (1991). Loss aversion in riskless choice: A reference-dependent model. The Quarterly Journal of Economics, 106, 1039–1061.
Article Google Scholar
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308.
Article Google Scholar
Vispoel, W., & Coffman, D. (1994). Computerized-adaptive and self-adaptive music-listening tests: Psychometric features and motivational benefits. Applied Measurement in Education, 7, 25–51.
Article Google Scholar
Wainer, H. (2011). Uneducated guesses: Using evidence to uncover misguided education policies. Princeton, NJ: Princeton University Press.
Book Google Scholar
Wainer, H., & Thissen, D. (1994). On examinee choice in educational testing. Review of Educational Research, 64, 159–195.
Article Google Scholar
Wainer, H., Wang, X. B., & Thissen, D. (1994). How well can we compare scores on test forms that are constructed by examinees’ choice? Journal of Educational Measurement, 31, 183–199.
Article Google Scholar
Wang, W., Jin, K., Qiu, X., & Wang, L. (2012). Item response models for examinee-selected items. Journal of Educational Measurement, 49, 419–445.
Article Google Scholar
Wang, X.B. (1992). Achieving equity in self-selected subsets of test items (Unpublished doctoral dissertation). University of Hawaii.
Wang, X. B., Wainer, H., & Thissen, D. (1995). On the viability of some untestable assumptions equating exams that allow examinee choice. Applied Measurement in Education, 8, 211–225.
Article Google Scholar
Wise, S. (1994). Understanding self-adaptive testing: The perceived control hypothesis. Applied Measurement in Education, 7, 15–24.
Article Google Scholar
Wise, S., Plake, B., Johnson, P., & Roos, L. (1992). A comparison of self-adapted and computerized adaptive tests. Journal of Educational Measurement, 29, 329–339.
Article Google Scholar
Yoon, S.Y. (2011). Psychometric properties of the Revised Purdue Spatial Visualization tests: Visualization of rotations (the revised PSVT-R) (Unpublished doctoral dissertation). Purdue University.

Download references

Acknowledgments

This research was possible with a grant from the Illinois Campus Research Board. The authors acknowledge undergraduate research assistants Yusheng Feng, Simon Gaberov, Kulsumjeham Siddiqui, and Darren Ward for assistance with data collection.

Author information

Authors and Affiliations

Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, IL, 61820 , USA
Steven Andrew Culpepper & James Joseph Balamuta

Authors

Steven Andrew Culpepper
View author publications
You can also search for this author in PubMed Google Scholar
James Joseph Balamuta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Andrew Culpepper.

Appendix

1.1 Parameter Recovery Monte Carlo Simulation

This section reports results of a Monte Carlo simulation designed to assess the ability of the proposed algorithm to recover item parameters. Specifically, the estimated model parameters reported in Table 3 were used as population values and data for 252 subjects were simulated to assess bias and root mean squared error (RMSE). Furthermore, the Monte Carlo simulation employed the experimental fixed-effects and random-effects design matrices \(\mathbf {X}\) and \(\mathbf {W}\) to generate data from the model.

Figures 9 and 10 report parameter bias and RMSE based upon 1000 replications. Figure 9 provides evidence of minimal bias for a small sample size of 252 participants. Figure 10 plots RMSE for the IRT, Thurstone, and hierarchical model parameters. In particular, RMSE for the structural coefficients (i.e., \(\varvec{\beta }\)) and random-effect variances (i.e., \(\text {diag}\left( \varvec{\Sigma }_{\varvec{\zeta }}\right) \) was generally smaller than the RMSE for the item slopes and thresholds. Furthermore, the RMSE for the payout condition fixed-effects (i.e., the first six fixed-effects for \(\varvec{\gamma }\)) was smaller than for the remaining 27 item evaluations. The difference in RMSE for the payout condition main-effects and item evaluations is expected given that a subset of the paired comparisons were collected and items received fewer evaluations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Culpepper, S.A., Balamuta, J.J. A Hierarchical Model for Accuracy and Choice on Standardized Tests. Psychometrika 82, 820–845 (2017). https://doi.org/10.1007/s11336-015-9484-7

Download citation

Received: 23 April 2015
Published: 25 November 2015
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11336-015-9484-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hierarchical Model for Accuracy and Choice on Standardized Tests

Abstract

Access this article

Similar content being viewed by others

Item Response Theory

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

An R toolbox for score-based measurement invariance tests in IRT models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Parameter Recovery Monte Carlo Simulation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Hierarchical Model for Accuracy and Choice on Standardized Tests

Abstract

Access this article

Similar content being viewed by others

Item Response Theory

Contributions to the Quantitative Assessment of Item, Test, and Score Fairness

An R toolbox for score-based measurement invariance tests in IRT models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Parameter Recovery Monte Carlo Simulation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation