, Volume 82, Issue 3, pp 820–845 | Cite as

A Hierarchical Model for Accuracy and Choice on Standardized Tests

  • Steven Andrew Culpepper
  • James Joseph Balamuta


This paper assesses the psychometric value of allowing test-takers choice in standardized testing. New theoretical results examine the conditions where allowing choice improves score precision. A hierarchical framework is presented for jointly modeling the accuracy of cognitive responses and item choices. The statistical methodology is disseminated in the ‘cIRT’ R package. An ‘answer two, choose one’ (A2C1) test administration design is introduced to avoid challenges associated with nonignorable missing data. Experimental results suggest that the A2C1 design and payout structure encouraged subjects to choose items consistent with their cognitive trait levels. Substantively, the experimental data suggest that item choices yielded comparable information and discrimination ability as cognitive items. Given there are no clear guidelines for writing more or less discriminating items, one practical implication is that choice can serve as a mechanism to improve score precision.


high-stakes testing item response theory Thurstonian models Bayesian statistics choice 



This research was possible with a grant from the Illinois Campus Research Board. The authors acknowledge undergraduate research assistants Yusheng Feng, Simon Gaberov, Kulsumjeham Siddiqui, and Darren Ward for assistance with data collection.


  1. Albert, J. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational and Behavioral Statistics, 17(3), 251–269.CrossRefGoogle Scholar
  2. Allen, N., Holland, P., & Thayer, D. (2005). Measuring the benefits of examinee-selected questions. Journal of Educational Measurement, 42, 27–51.CrossRefGoogle Scholar
  3. Azzalini, A., & Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715–726.CrossRefGoogle Scholar
  4. Béguin, A. A., & Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–561.CrossRefGoogle Scholar
  5. Böckenholt, U. (2001). Hierarchical modeling of paired comparison data. Psychological Methods, 6(1), 49.CrossRefPubMedGoogle Scholar
  6. Böckenholt, U. (2004). Comparative judgments as an alternative to ratings: Identifying the scale origin. Psychological Methods, 9(4), 453.CrossRefPubMedGoogle Scholar
  7. Böckenholt, U. (2006). Thurstonian-based analyses: Past, present, and future utilities. Psychometrika, 71(4), 615–629.CrossRefPubMedGoogle Scholar
  8. Bradlow, E., & Thomas, N. (1998). Item response theory models applied to data allowing examinee choice. Journal of Educational and Behavioral Statistics, 23, 236–243.CrossRefGoogle Scholar
  9. Bridgeman, B., Morgan, R., & Wang, M.-M. (1997). Choice among essay topics: Impact on performance and validity. Journal of Educational Measurement, 34(3), 273–286.CrossRefGoogle Scholar
  10. Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.Google Scholar
  11. Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502.CrossRefGoogle Scholar
  12. Carmona, R. (2009). Indifference pricing: Theory and applications. Princeton, NJ: Princeton University Press.Google Scholar
  13. Cattelan, M., et al. (2012). Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, 27(3), 412–433.CrossRefGoogle Scholar
  14. Coombs, C. H., Milholland, J. E., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16(1), 13–37.CrossRefGoogle Scholar
  15. Croson, R. (2005). The method of experimental economics. International Negotiation, 10, 131–148.CrossRefGoogle Scholar
  16. Culpepper, S.A. (2015). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika.Google Scholar
  17. Eddelbuettel, D. (2013). Seamless R and C++ integration with Rcpp. New York: Springer.CrossRefGoogle Scholar
  18. Fox, J.-P. (2010). Bayesian item response modeling. New York: Springer.CrossRefGoogle Scholar
  19. Guay, R. (1976). Purdue spatial visualization test. West Layfette, IN: Purdue University.Google Scholar
  20. Hakstian, A. R., & Kansup, W. (1975). A comparison of several methods of assessing partial knowledge in multiple choice tests: II Testing procedures. Journal of Educational Measurement, 12(4), 231–239.CrossRefGoogle Scholar
  21. Hontangas, P., Ponsado, V., Olea, J., & Wise, S. (2000). The choice of item difficulty in self-adapted testing. European Journal of Psychological Assessment, 16, 3–12.CrossRefGoogle Scholar
  22. Kahneman, D. (2003). Maps of bounded rationality: Psychology for behavioral economics. American Economic Review, 93, 1449–1475.CrossRefGoogle Scholar
  23. Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1990). Experimental tests of the endowment effect and the Coase theorem. Journal of Political Economy, 98, 1325–1348.CrossRefGoogle Scholar
  24. Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1991). Anomalies: The endowment effect, loss aversion, and status quo bias. The Journal of Economic Perspectives, 5, 193–206.CrossRefGoogle Scholar
  25. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250.CrossRefGoogle Scholar
  26. Maeda, Y., & Yoon, S. (2013). A meta-analysis on gender differences in mental rotation ability measured by the Purdue spatial visualization tests: Visualization of rotations (PSVT:R). Educational Psychology Review, 25, 69–94.CrossRefGoogle Scholar
  27. Maeda, Y., Yoon, S. Y., Kim-Kang, G., & Imbrie, P. (2013). Psychometric properties of the revised PSVT: R for measuring first year engineering students’ spatial ability. International Journal of Engineering Education, 29(3), 763–776.Google Scholar
  28. Maydeu-Olivares, A., & Böckenholt, U. (2005). Structural equation modeling of paired-comparison and ranking data. Psychological Methods, 10(3), 285.CrossRefPubMedGoogle Scholar
  29. McFadden, D. (2001). Economic choices. American Economic Review, 91, 351–378.CrossRefGoogle Scholar
  30. Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342–366.CrossRefGoogle Scholar
  31. Pitkin, A., & Vispoel, W. (2001). Differences between self-adapted and computerized adaptive tests: A meta-analysis. Journal of Educational Measurement, 38, 235–247.CrossRefGoogle Scholar
  32. Powers, D., & Bennett, R. (2000). Effects of allowing examinees to select questions on a test of divergent thinking. Applied Measurement in Education, 12, 257–279.CrossRefGoogle Scholar
  33. Revuelta, J. (2004). Estimating ability and item-selection strategy in self-adapted testing: A latent class approach. Journal of Educational and Behavioral Statistics, 29, 379–396.CrossRefGoogle Scholar
  34. Rocklin, T. (1994). Self-adapted testing. Applied Measurement in Education, 7, 3–14.CrossRefGoogle Scholar
  35. Rocklin, T., & O’Donnell, A. (1987). Self-adapted testing: A performance-improving variant of computerized adaptive testing. Journal of Educational Psychology, 79, 315–319.CrossRefGoogle Scholar
  36. Rocklin, T., O’Donnell, A., & Holst, P. (1995). Effects and underlying mechanisms of self-adapted testing. Journal of Educational Psychology, 87, 103–116.CrossRefGoogle Scholar
  37. Ross, S. (2011). An elementary introduction to mathematical finance (3rd ed.). New York: Cambridge University Press.CrossRefGoogle Scholar
  38. Rubin, D. (1976). Inference and missing data. Biometrika, 63, 581–592.CrossRefGoogle Scholar
  39. Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist, 55(1), 68.CrossRefPubMedGoogle Scholar
  40. Schraw, G., Flowerday, T., & Reisetter, M. (1998). The role of choice in reader engagement. Journal of Educational Psychology, 90, 705–714.CrossRefGoogle Scholar
  41. Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30(4), 298–321.CrossRefGoogle Scholar
  42. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273.CrossRefGoogle Scholar
  43. Tsai, R.-C. (2000). Remarks on the identifiability of Thurstonian ranking models: Case V, Case III, or neither? Psychometrika, 65(2), 233–240.CrossRefGoogle Scholar
  44. Tsai, R.-C. (2003). Remarks on the identifiability of Thurstonian paired comparison models under multiple judgment. Psychometrika, 68(3), 361–372.CrossRefGoogle Scholar
  45. Tsai, R.-C., & Böckenholt, U. (2002). Two-level linear paired comparison models: Estimation and identifiability issues. Mathematical Social Sciences, 43(3), 429–449.CrossRefGoogle Scholar
  46. Tsai, R.-C., & Böckenholt, U. (2006). Modelling intransitive preferences: A random-effects approach. Journal of Mathematical Psychology, 50(1), 1–14.CrossRefGoogle Scholar
  47. Tsai, R.-C., & Böckenholt, U. (2008). On the importance of distinguishing between within-and between-subject effects in intransitive intertemporal choice. Journal of Mathematical Psychology, 52(1), 10–20.CrossRefGoogle Scholar
  48. Tversky, A., & Kahneman, D. (1991). Loss aversion in riskless choice: A reference-dependent model. The Quarterly Journal of Economics, 106, 1039–1061.CrossRefGoogle Scholar
  49. van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308.CrossRefGoogle Scholar
  50. Vispoel, W., & Coffman, D. (1994). Computerized-adaptive and self-adaptive music-listening tests: Psychometric features and motivational benefits. Applied Measurement in Education, 7, 25–51.CrossRefGoogle Scholar
  51. Wainer, H. (2011). Uneducated guesses: Using evidence to uncover misguided education policies. Princeton, NJ: Princeton University Press.CrossRefGoogle Scholar
  52. Wainer, H., & Thissen, D. (1994). On examinee choice in educational testing. Review of Educational Research, 64, 159–195.CrossRefGoogle Scholar
  53. Wainer, H., Wang, X. B., & Thissen, D. (1994). How well can we compare scores on test forms that are constructed by examinees’ choice? Journal of Educational Measurement, 31, 183–199.CrossRefGoogle Scholar
  54. Wang, W., Jin, K., Qiu, X., & Wang, L. (2012). Item response models for examinee-selected items. Journal of Educational Measurement, 49, 419–445.CrossRefGoogle Scholar
  55. Wang, X.B. (1992). Achieving equity in self-selected subsets of test items (Unpublished doctoral dissertation). University of Hawaii.Google Scholar
  56. Wang, X. B., Wainer, H., & Thissen, D. (1995). On the viability of some untestable assumptions equating exams that allow examinee choice. Applied Measurement in Education, 8, 211–225.CrossRefGoogle Scholar
  57. Wise, S. (1994). Understanding self-adaptive testing: The perceived control hypothesis. Applied Measurement in Education, 7, 15–24.CrossRefGoogle Scholar
  58. Wise, S., Plake, B., Johnson, P., & Roos, L. (1992). A comparison of self-adapted and computerized adaptive tests. Journal of Educational Measurement, 29, 329–339.CrossRefGoogle Scholar
  59. Yoon, S.Y. (2011). Psychometric properties of the Revised Purdue Spatial Visualization tests: Visualization of rotations (the revised PSVT-R) (Unpublished doctoral dissertation). Purdue University.Google Scholar

Copyright information

© The Psychometric Society 2015

Authors and Affiliations

  • Steven Andrew Culpepper
    • 1
  • James Joseph Balamuta
    • 1
  1. 1.Department of StatisticsUniversity of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations