Advertisement

Psychometrika

, Volume 83, Issue 1, pp 132–155 | Cite as

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

  • Ting Wang
  • Carolin Strobl
  • Achim Zeileis
  • Edgar C. Merkle
Article

Abstract

Measurement invariance is a fundamental assumption in item response theory models, where the relationship between a latent construct (ability) and observed item responses is of interest. Violation of this assumption would render the scale misinterpreted or cause systematic bias against certain groups of persons. While a number of methods have been proposed to detect measurement invariance violations, they typically require advance definition of problematic item parameters and respondent grouping information. However, these pieces of information are typically unknown in practice. As an alternative, this paper focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when measurement invariance is assumed to hold), and they have been previously applied in factor-analytic, continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is also provided.

Keywords

pairwise maximum likelihood score-based test item response theory differential item functioning 

Supplementary material

11336_2017_9591_MOESM1_ESM.zip (212 kb)
Supplementary material 1 (zip 212 KB)

References

  1. Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61, 821–856.  https://doi.org/10.2307/2951764.CrossRefGoogle Scholar
  2. Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340.  https://doi.org/10.1007/s11336-014-9408-y.CrossRefPubMedGoogle Scholar
  3. Bock, R. D., & Schilling, S. (1997). High-dimensional full-information item factor analysis. In M. Berkane (Ed.), Latent variable modeling and applications to causality (pp. 163–176). New York, NY: Springer.  https://doi.org/10.1007/978-1-4612-1842-5_8.CrossRefGoogle Scholar
  4. Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.  https://doi.org/10.18637/jss.v048.i06.CrossRefGoogle Scholar
  5. De Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press.Google Scholar
  6. Doolaard, S. (1999). Schools in change or schools in chains. Unpublished doctoral dissertation, University of Twente, The NetherlandsGoogle Scholar
  7. Dorans, N. J. (2004). Using subpopulation invariance to assess test score equity. Journal of Educational Measurement, 41(1), 43–68.  https://doi.org/10.1111/j.1745-3984.2004.tb01158.x.CrossRefGoogle Scholar
  8. Fischer, G. H. (1995a). Derivations of the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models (pp. 15–38). New York, NY: Springer.  https://doi.org/10.1007/978-1-4612-4230-7_2.CrossRefGoogle Scholar
  9. Fischer, G. H. (1995b). Some neglected problems in IRT. Psychometrika, 60(4), 459–487.  https://doi.org/10.1007/bf02294324.CrossRefGoogle Scholar
  10. Fischer, G. H., & Molenaar, I. W. (2012). Rasch models: Foundations, recent developments, and applications. Berlin: Springer.  https://doi.org/10.1007/978-1-4612-4230-7.Google Scholar
  11. Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer.  https://doi.org/10.1007/978-1-4419-0742-4.CrossRefGoogle Scholar
  12. Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8(3), 647–667.Google Scholar
  13. Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64(3), 273–294.  https://doi.org/10.1007/bf02294296.CrossRefGoogle Scholar
  14. Glas, C. A. W. (2009). Item parameter estimation and item fit analysis. In W. van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (pp. 269–288). New York, NY: Springer.  https://doi.org/10.1007/978-0-387-85461-8_14.CrossRefGoogle Scholar
  15. Glas, C. A. W. (2010). Testing fit to IRT models for polytomously scored items. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 185–210). New York, NY: Routledge.Google Scholar
  16. Glas, C. A. W. (2015). Item response theory models in behavioral social science: Assessment of fit. Wiley StatsRef: Statistics Reference Online.  https://doi.org/10.1002/9781118445112.stat06436.pub2.
  17. Glas, C. A. W., & Falcón, J. C. S. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106.  https://doi.org/10.1177/0146621602250530.CrossRefGoogle Scholar
  18. Glas, C. A. W., & Jehangir, K. (2014). Modeling country-specific differential item functioning. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 97–115). Boca Raton, FL: Chapman and Hall/CRC.  https://doi.org/10.1111/jedm.12095.Google Scholar
  19. Glas, C. A. W., & Linden, W. J. (2010). Marginal likelihood inference for a model for item responses and response times. British Journal of Mathematical and Statistical Psychology, 63(3), 603–626.  https://doi.org/10.1348/000711009x481360.CrossRefPubMedGoogle Scholar
  20. Hjort, N. L., & Koning, A. (2002). Tests for constancy of model parameters over time. Nonparametric Statistics, 14, 113–132.  https://doi.org/10.1080/10485250211394.CrossRefGoogle Scholar
  21. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Routledge.Google Scholar
  22. Katsikatsou, M., & Moustaki, I. (2016). Pairwise likelihood ratio tests and model selection criteria for structural equation models with ordinal variables. Psychometrika, 81(4), 1046–1068.  https://doi.org/10.1007/s11336-016-9523-z.CrossRefPubMedGoogle Scholar
  23. Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis, 56(12), 4243–4258.  https://doi.org/10.1016/j.csda.2012.04.010.CrossRefGoogle Scholar
  24. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. New York: Springer.  https://doi.org/10.1007/978-1-4757-4310-4_10.CrossRefGoogle Scholar
  25. Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56.  https://doi.org/10.1177/0013164414529792.CrossRefGoogle Scholar
  26. Lord, F. M. (1980). Applications of item response theory to practical testing problems. New York: Routledge.  https://doi.org/10.4324/9780203056615.Google Scholar
  27. Magis, D., Beland, S., & Raiche, G. (2015). difR: Collection of methods to detect dichotomous differential item functioning (DIF) [Computer software manual]. (R package version 4.6).  https://doi.org/10.3758/brm.42.3.847.
  28. Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862.  https://doi.org/10.3758/brm.42.3.847.CrossRefPubMedGoogle Scholar
  29. Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counterexample with Angoff’s delta plot. Educational and Psychological Measurement, 73(2), 293–311.  https://doi.org/10.1177/0013164412451903.CrossRefGoogle Scholar
  30. Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143.  https://doi.org/10.1016/0883-0355(89)90002-5.CrossRefGoogle Scholar
  31. Merkle, E. C., Fan, J., & Zeileis, A. (2014). Testing for measurement invariance with respect to an ordinal variable. Psychometrika, 79, 569–584.  https://doi.org/10.1007/s11336-013-9376-7.CrossRefPubMedGoogle Scholar
  32. Merkle, E. C., & Zeileis, A. (2013). Tests of measurement invariance without subgroups: A generalization of classical methods. Psychometrika, 78, 59–82.  https://doi.org/10.1007/s11336-012-9302-4.CrossRefPubMedGoogle Scholar
  33. Millsap, R. E. (2005). Four unresolved problems in studies of factorial invariance. In A. Maydeu-Olivares & J. J. McArdle (Eds.), Contemporary psychometrics (pp. 153–171). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  34. Millsap, R. E. (2012). Statistical approaches to measurement invariance. New York: Routledge.  https://doi.org/10.4324/9780203821961.Google Scholar
  35. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297–334.  https://doi.org/10.1177/014662169301700401.CrossRefGoogle Scholar
  36. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.  https://doi.org/10.1177/014662169201600206.CrossRefGoogle Scholar
  37. Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage.  https://doi.org/10.4135/9781412993913.CrossRefGoogle Scholar
  38. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502.  https://doi.org/10.1007/bf02294403.CrossRefGoogle Scholar
  39. R Core Team. (2017). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/.
  40. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.  https://doi.org/10.18637/jss.v048.i02.CrossRefGoogle Scholar
  41. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement,.  https://doi.org/10.1007/bf03372160.Google Scholar
  42. Satorra, A. (1989). Alternative test criteria in covariance structure analysis: A unified approach. Psychometrika, 54, 131–151.  https://doi.org/10.1007/bf02294453.CrossRefGoogle Scholar
  43. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555.  https://doi.org/10.1007/s11336-003-1141-x.Google Scholar
  44. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306.  https://doi.org/10.1037/0021-9010.91.6.1292.CrossRefPubMedGoogle Scholar
  45. Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80, 289–316.  https://doi.org/10.1007/s11336-013-9388-3.CrossRefPubMedGoogle Scholar
  46. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370.  https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.CrossRefGoogle Scholar
  47. Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408.  https://doi.org/10.1007/bf02294363.CrossRefGoogle Scholar
  48. Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175–186.  https://doi.org/10.1007/bf02296273.CrossRefGoogle Scholar
  49. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–172). Hillsdale, NJ: Lawrence Erlbaum Associates.  https://doi.org/10.2307/1164765.Google Scholar
  50. Tutz, G., & Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1), 21–43.  https://doi.org/10.1007/s11336-013-9377-6.CrossRefPubMedGoogle Scholar
  51. Van den Noortgate, W., & De Boeck, P. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30(4), 443–464.  https://doi.org/10.3102/10769986030004443.CrossRefGoogle Scholar
  52. Verhagen, J., Levy, R., Millsap, R. E., & Fox, J.-P. (2016). Evaluating evidence for invariant items: A Bayes factor applied to testing measurement invariance in IRT models. Journal of Mathematical Psychology, 72, 171–182.  https://doi.org/10.1016/j.jmp.2015.06.005.CrossRefGoogle Scholar
  53. Wang, T., Merkle, E., & Zeileis, A. (2014). Score-based tests of measurement invariance: Use in practice. Frontiers in Psychology, 5(438), 1–11.  https://doi.org/10.3389/fpsyg.2014.00438.Google Scholar
  54. Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498.  https://doi.org/10.1177/0146621603259902.CrossRefGoogle Scholar
  55. Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57.  https://doi.org/10.1177/0146621607314044.CrossRefGoogle Scholar
  56. Zeileis, A. (2006). Implementing a class of structural change tests: An econometric computing approach. Computational Statistics & Data Analysis, 50(11), 2987–3008.  https://doi.org/10.1016/j.csda.2005.07.001.CrossRefGoogle Scholar
  57. Zeileis, A., & Hornik, K. (2007). Generalized M-fluctuation tests for parameter instability. Statistica Neerlandica, 61, 488–508.  https://doi.org/10.1111/j.1467-9574.2007.00371.x.CrossRefGoogle Scholar
  58. Zeileis, A., Leisch, F., Hornik, K., & Kleiber, C. (2002). strucchange: An R package for testing structural change in linear regression models: An R package for testing structural change in linear regression models. Journal of Statistical Software, 7(2), 1–38.  https://doi.org/10.18637/jss.v007.i02.CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2017

Authors and Affiliations

  1. 1.Department of Psychological SciencesUniversity of MissouriColumbiaUSA
  2. 2.Department of PsychologyUniversity of ZurichZurichSwitzerland
  3. 3.Department of StatisticsUniversität InnsbruckInnsbruckAustria

Personalised recommendations