Advertisement

Psychometrika

, 73:533 | Cite as

Random Item IRT Models

  • Paul De BoeckEmail author
Presidential Address

Abstract

It is common practice in IRT to consider items as fixed and persons as random. Both, continuous and categorical person parameters are most often random variables, whereas for items only continuous parameters are used and they are commonly of the fixed type, although exceptions occur. It is shown in the present article that random item parameters make sense theoretically, and that in practice the random item approach is promising to handle several issues, such as the measurement of persons, the explanation of item difficulties, and trouble shooting with respect to DIF. In correspondence with these issues, three parts are included. All three rely on the Rasch model as the simplest model to study, and the same data set is used for all applications. First, it is shown that the Rasch model with fixed persons and random items is an interesting measurement model, both, in theory, and for its goodness of fit. Second, the linear logistic test model with an error term is introduced, so that the explanation of the item difficulties based on the item properties does not need to be perfect. Finally, two more models are presented: the random item profile model (RIP) and the random item mixture model (RIM). In the RIP, DIF is not considered a discrete phenomenon, and when a robust regression approach based on the RIP difficulties is applied, quite good DIF identification results are obtained. In the RIM, no prior anchor sets are defined, but instead a latent DIF class of items is used, so that posterior anchoring is realized (anchoring based on the item mixture). It is shown that both approaches are promising for the identification of DIF.

Keywords

random effects generalizability measurement LLTM DIF 

References

  1. Adams, R., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, 47–76. Google Scholar
  2. Albers, W., Does, R.J.M.M., Ombos, Tj., & Janssen, M.P.E. (1989). A stochastic growth model applied to tests of academic knowledge. Psychometrika, 54, 451–466. CrossRefGoogle Scholar
  3. Andersen, E.B. (1980). Discrete statistical models with social science applications. Amsterdam: North-Holland. Google Scholar
  4. Angoff, W.H., & Ford, S.F. (1973). Item-race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–106. CrossRefGoogle Scholar
  5. Bates, D., Maechler, M., & Dai, B. (2008). The lme4 Package version 0.999375-26. http://cran.r-project.org/web/packages/lme4/lme4.pdf/.
  6. Bejar, I.I. (1993). A generative approach to psychological and educational measurement. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests (pp. 323–359). Google Scholar
  7. Bejar, I.I., Lawless, R.R., Morley, M.E., Wagner, M.E., Bennett, R.E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment, 2, 1–29. Google Scholar
  8. Bock, R.D., & Mislevy, R.J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444. CrossRefGoogle Scholar
  9. Briggs, D.C., & Wilson, M. (2007). Generalizability in item response modeling. Journal of Educational Measurement, 44, 131–155. CrossRefGoogle Scholar
  10. Camilli, G., & Shepard, L.A. (1994). Methods for identifying biased test items. Sage: Thousand Oaks. Google Scholar
  11. Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in proficiency tests. Language Testing, 2, 155–163. CrossRefGoogle Scholar
  12. Cho, S.-J., & Rabe-Hesketh, S. (2008). Estimating item response models with random item parameters. Unpublished manuscript. Google Scholar
  13. Clark, H.H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335–359. CrossRefGoogle Scholar
  14. Coleman, E.B. (1964). Generalizing to a language population. Psychological Reports, 14, 219–226. Google Scholar
  15. De Boeck, P., & Wilson, M. (2004). Explanatory item response models. New York: Springer. Google Scholar
  16. De Boeck, P., Wilson, M., & Acton, S. (2005). A conceptual and psychometric framework for distinguishing categories and dimensions. Psychological Review, 112, 129–158. CrossRefPubMedGoogle Scholar
  17. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantal-Haenszel and standardization. In P.W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale: Erlbaum. Google Scholar
  18. Dorans, N.J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368. CrossRefGoogle Scholar
  19. Embretson, S.E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407–433. CrossRefGoogle Scholar
  20. Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374. CrossRefGoogle Scholar
  21. Frederickx, S., Tuerlinckx, F., De Boeck, P., & Magis, D. (2008). An item mixture model to detect differential item functioning. Unpublished manuscript, K.U. Leuven. Google Scholar
  22. Gelman, A., & Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–411. CrossRefGoogle Scholar
  23. Glas, C.A.W., & van der Linden, W.J. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27, 247–261. CrossRefGoogle Scholar
  24. Hively, W., Patterson, H.L., & Page, S.H. (1968). A “universe-defined” system of arithmetic achievement tests. Journal of Educational Measurement, 5, 275–290. CrossRefGoogle Scholar
  25. Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & J.I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale: Lawrence Erlbaum. Google Scholar
  26. Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale: Lawrence Erlbaum. Google Scholar
  27. Ironson, G.H., Homan, S., Willis, R., & Singer, B. (1984). The validity of item bias techniques with math word problems. Applied Psychological Measurement, 8, 391–396. CrossRefGoogle Scholar
  28. Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for criterion-referenced measurement. Journal of Educational and Behavioral Statistics, 25, 285–306. Google Scholar
  29. Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and item group predictors. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 189–212). New York: Springer. Google Scholar
  30. Johnson, P.M., & Sinharay, S., (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29, 369–400. CrossRefGoogle Scholar
  31. Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337. CrossRefGoogle Scholar
  32. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748. PubMedGoogle Scholar
  33. McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46. CrossRefGoogle Scholar
  34. Millsap, R.E., & Everson, H.T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334. CrossRefGoogle Scholar
  35. Raaijmakers, J., Schrijnemakers, J., & Gremmen, F. (1999). How to deal with “the language-as-fixed-effect-fallacy”: Common misconceptions and alternative solutions. Journal of Memory and Language, 41, 416–426. CrossRefGoogle Scholar
  36. Popham, W.J. (1978). Criterion-referenced measurement. Englewood Cliffs: Prentice-Hall. Google Scholar
  37. Rouder, J.N., Lu, J., Speckman, P.L., Sun, D., Morey, R.D., & Naveh-Benjamin, M. (2007). Signal detection models with random participant and random item effects. Psychometrika, 72, 621–624. CrossRefGoogle Scholar
  38. Rousseeuw, P.J., & Leroy, A.M. (1987). Robust regression and outlier detection. New York: Wiley. CrossRefGoogle Scholar
  39. Rousseeuw, P.J., & van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. CrossRefGoogle Scholar
  40. Roussos, L.A., Templin, J.L., & Henson, R.A. (2007). Skills diagnosis using IRT-based latent class models. Journal of Educational Measurement, 44, 293–311. CrossRefGoogle Scholar
  41. Savalei, V. (2006). Logistic approximation to the normal: The KL rationale. Psychometrika, 71, 763–767. CrossRefGoogle Scholar
  42. Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlation: Uses in assessing reliability. Psychological Bulletin, 86, 420–428. CrossRefPubMedGoogle Scholar
  43. Shepard, L., Camilli, G., & Williams, D.M. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22, 77–105. CrossRefGoogle Scholar
  44. Sinharay, S., Johnson, M.S., & Williamson, D.M. (2003). Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Sciences, 28, 295–313. CrossRefGoogle Scholar
  45. Snijders, T.A.B., & Bosker, R.J. (1999). Multilevel analysis. An introduction to basic and advanced multilevel modeling. London: Sage. Google Scholar
  46. StataCorp (2007). Stata statistical software: Release 10. College Station: StataCorp LP. Google Scholar
  47. Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–70. CrossRefGoogle Scholar
  48. Tan, E.S., Ambergen, A.W., Does, R.J.M.M., & Imbos, Tj. (1999). Approximations of normal IRT models for change. Journal of Educational and Behavioral Statistics, 24, 208–223. Google Scholar
  49. Teresi, J.A. (2001). Statistical methods for examination of differential item functioning (DIF)—with applications to cross-cultural measurement of functional, physical and mental health. Journal of Mental Health and Aging, 7, 31–40. Google Scholar
  50. Thierny, L., & Kadane, J.R. (1986). Accurate approximations for the posterior moments and marginal densities. Journal of the American Statistical Association, 81, 82–86. CrossRefGoogle Scholar
  51. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99, 118–128. CrossRefGoogle Scholar
  52. Tuerlinckx, F., Rijmen, F., Verbeke, G., & De Boeck, P. (2006). Statistical inference in generalized linear mixed models: A review. British Journal of Mathematical and Statistical Psychology, 59, 225–255. CrossRefPubMedGoogle Scholar
  53. Van den Noortgate, W., De Boeck, P., & Meulders, M. (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28, 369–386. CrossRefGoogle Scholar
  54. Verhelst, N.D., & Eggen, T.J.H.M. (1989). Psychometrische en statistische aspecten van peilingsonderzoek (PPON rapport 4). Arnhem: Cito. Google Scholar
  55. Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of Rasch models. Journal of Experimental Education, 72, 221–261. CrossRefGoogle Scholar
  56. Zwinderman, A.H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56, 589–600. CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2008

Authors and Affiliations

  1. 1.K.U.LeuvenLeuvenBelgium

Personalised recommendations