The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics

Abstract

The primary purpose of this study is to investigate the mathematical characteristics of the test reliability coefficient ρXX as a function of item response theory (IRT) parameters and present the lower and upper bounds of the coefficient. Another purpose is to examine relative performances of the IRT reliability statistics and two classical test theory (CTT) reliability statistics (Cronbach’s alpha and Feldt–Gilmer congeneric coefficients) under various testing conditions that result from manipulating large-scale real data. For the first purpose, two alternative ways of exactly quantifying ρXX are compared in terms of computational efficiency and statistical usefulness. In addition, the lower and upper bounds for ρXX are presented in line with the assumptions of essential tau-equivalence and congeneric similarity, respectively. Empirical studies conducted for the second purpose showed across all testing conditions that (1) the IRT reliability coefficient was higher than the CTT reliability statistics; (2) the IRT reliability coefficient was closer to the Feldt–Gilmer coefficient than to the Cronbach’s alpha coefficient; and (3) the alpha coefficient was close to the lower bound of IRT reliability. Some advantages of the IRT approach to estimating test-score reliability over the CTT approaches are discussed in the end.

This is a preview of subscription content, log in to check access.

References

  1. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.

    Google Scholar 

  2. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.

    Article  Google Scholar 

  3. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

    Article  Google Scholar 

  4. Dimitrov, D. M. (2003). Marginal true-score measures and reliability for binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440–458.

    Article  Google Scholar 

  5. Feldt, L. S. (2002). Estimating the internal consistency reliability of tests composed of testlets varying in length. Applied Measurement in Education, 15, 33–48.

    Article  Google Scholar 

  6. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.

    Google Scholar 

  7. Gilmer, J. S., & Feldt, L. S. (1983). Reliability estimation for a test with parts of unknown lengths. Psychometrika, 48, 99–111.

    Article  Google Scholar 

  8. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

    Google Scholar 

  9. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  10. Harwell, M. R., & Baker, F. B. (1991). The use of prior distributions in marginalized Bayesian item parameter estimation: A didactic. Applied Psychological Measurement, 15, 375–389.

    Article  Google Scholar 

  11. Harwell, M. R., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13, 243–271.

    Article  Google Scholar 

  12. Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68, 123–149.

    Article  Google Scholar 

  13. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer.

    Google Scholar 

  14. Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129–140.

    Article  Google Scholar 

  15. Lord, F. M. (1980). Applications of item response theory to practical testing applications. Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  16. May, K., & Nicewander, W. A. (1994). Reliability and information functions for percentile ranks. Journal of Educational Measurement, 31, 313–325.

    Article  Google Scholar 

  17. Meredith, W. (1965). Some results based on a general stochastic model for mental tests. Psychometrika, 30, 419–440.

    Article  Google Scholar 

  18. Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359–381.

    Article  Google Scholar 

  19. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

    Article  Google Scholar 

  20. Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software International.

    Google Scholar 

  21. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

    Article  Google Scholar 

  22. Muraki, E., & Bock, R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data. Lincolnwood, IL: Scientific Software International.

    Google Scholar 

  23. Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32, 1–13.

    Article  Google Scholar 

  24. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing (2nd ed.). New York, NY: Cambridge University Press.

    Google Scholar 

  25. Shojima, K., & Toyoda, H. (2002). Estimation of Cronbach’s alpha coefficient in the context of item response theory. The Japanese Journal of Psychology, 73, 227–233. (In Japanese).

    Google Scholar 

  26. Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175–186.

    Article  Google Scholar 

  27. Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39–49.

    Article  Google Scholar 

  28. Tsutakawa, R. K., & Lin, H. Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51, 251–267.

    Article  Google Scholar 

  29. Woodruff, D. J., & Hanson, B. A. (1996). Estimation of item response models using the EM algorithm for finite mixtures. Iowa City, IA: ACT, Inc. (ACT Research Report 96–6).

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Seonghoon Kim.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kim, S., Feldt, L.S. The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Educ. Rev. 11, 179–188 (2010). https://doi.org/10.1007/s12564-009-9062-8

Download citation

Keywords

  • Test reliability
  • Item response theory (IRT)
  • Lower and upper bounds of reliability coefficient
  • Test score metric versus ability score metric