Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy

Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 66)


The percentage of examinees who are classified consistently and accurately into the proficiency levels is an important measurement property of the tests that are used to classify the candidates. Given the suspected discrepancies between the classical test theory (CTT)- and item response theory (IRT)-based single-administration decision consistency and accuracy (DC/DA) estimates, these two approaches were evaluated for accuracy and robustness in various simulated conditions by varying the test length, ability distribution, and the degree of local item dependence (LID). The CTT-based Livingston–Lewis method was found underestimating the DC indices across all conditions and more sensitive to the short tests and skewed ability distributions. The IRT-based Lee method had small biases in most conditions except a high degree of LID. The violation of LID had a much greater negative effect on the DA estimate than on the DC estimate with both methods.


  1. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author.Google Scholar
  2. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397–472). Reading, MA: Addison-Wesley.Google Scholar
  3. Bourque, M. L., Goodman, D., Hambleton, R. K., & Han, N. (2004). Reliability estimates for the ABTE tests in elementary education, professional teaching knowledge, secondary mathematics and English/language arts (Final report). Leesburg, VA: Mid-Atlantic Psychometric Services.Google Scholar
  4. Brennan, R. L. (2004). BB-CLASS: A computer program that uses the beta-binomial model for classification consistency and accuracy (Version 1.0, CASMA Research Report No. 9). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
  5. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  6. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.CrossRefGoogle Scholar
  7. Deng, N. (2011). Evaluating IRT- and CTT-based methods of estimating classification consistency and accuracy indices from single administrations (Unpublished doctoral dissertation). Amherst, MA: University of Massachusetts.Google Scholar
  8. Hambleton, R. K., & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159–170.CrossRefGoogle Scholar
  9. Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345–359.CrossRefGoogle Scholar
  10. Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253–264.CrossRefGoogle Scholar
  11. Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational Statistics, 15, 353–368.Google Scholar
  12. Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17.CrossRefGoogle Scholar
  13. Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33, 374–390.CrossRefMathSciNetGoogle Scholar
  14. Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412–432.CrossRefMathSciNetGoogle Scholar
  15. Lee, W., & Kolen, M. J. (2008). IRT-CLASS: A computer program for item response theory classification consistency and accuracy (Version 2.0). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
  16. Li, S. (2006). Evaluating the consistency and accuracy of proficiency classifications using item response theory (Unpublished dissertation). Amherst, MA: University of Massachusetts.Google Scholar
  17. Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–197.CrossRefGoogle Scholar
  18. Muraki, E., & Bock, R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [Computer program]. Chicago, IL: Scientific Software International, Inc.Google Scholar
  19. Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(14). Available online: http://pareonline.net/getvn.asp?v=7&n=14
  20. Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online: http://pareonline.net/getvn.asp?v=10&n=13
  21. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika (Monograph Supplement, 17).Google Scholar
  22. Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13, 265–276.CrossRefGoogle Scholar
  23. Swaminathan, H., Hambleton, R. K., & Algina, J. (1974). Reliability of criterion referenced tests: A decision-theoretic formulation. Journal of Educational Measurement, 11, 263–267.CrossRefGoogle Scholar
  24. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Amsterdam: Kluwer Academic Publishers.Google Scholar
  25. Wan, L., Brennan, R. L., & Lee, W. (2007). Estimating classification consistency for complex assessments (CASMA Research Report No. 22). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
  26. Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Quantitative Health SciencesUniversity of Massachusetts Medical SchoolWorcesterUSA
  2. 2.Center for Educational Assessment, University of Massachusetts AmherstAmherstUSA

Personalised recommendations