Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Wang, Ting; Strobl, Carolin; Zeileis, Achim; Merkle, Edgar C.

doi:10.1007/s11336-017-9591-8

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Published: 17 November 2017

Volume 83, pages 132–155, (2018)
Cite this article

Psychometrika Aims and scope Submit manuscript

910 Accesses
12 Citations
6 Altmetric
Explore all metrics

Abstract

Measurement invariance is a fundamental assumption in item response theory models, where the relationship between a latent construct (ability) and observed item responses is of interest. Violation of this assumption would render the scale misinterpreted or cause systematic bias against certain groups of persons. While a number of methods have been proposed to detect measurement invariance violations, they typically require advance definition of problematic item parameters and respondent grouping information. However, these pieces of information are typically unknown in practice. As an alternative, this paper focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when measurement invariance is assumed to hold), and they have been previously applied in factor-analytic, continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An R toolbox for score-based measurement invariance tests in IRT models

Article Open access 16 December 2021

Score-based tests for detecting heterogeneity in linear mixed models

Article 14 July 2020

Examining Differential Item Functioning from a Multidimensional IRT Perspective

Article 01 March 2024

References

Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61, 821–856. https://doi.org/10.2307/2951764.
Article Google Scholar
Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340. https://doi.org/10.1007/s11336-014-9408-y.
Article PubMed Google Scholar
Bock, R. D., & Schilling, S. (1997). High-dimensional full-information item factor analysis. In M. Berkane (Ed.), Latent variable modeling and applications to causality (pp. 163–176). New York, NY: Springer. https://doi.org/10.1007/978-1-4612-1842-5_8.
Chapter Google Scholar
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06.
Article Google Scholar
De Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press.
Google Scholar
Doolaard, S. (1999). Schools in change or schools in chains. Unpublished doctoral dissertation, University of Twente, The Netherlands
Dorans, N. J. (2004). Using subpopulation invariance to assess test score equity. Journal of Educational Measurement, 41(1), 43–68. https://doi.org/10.1111/j.1745-3984.2004.tb01158.x.
Article Google Scholar
Fischer, G. H. (1995a). Derivations of the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models (pp. 15–38). New York, NY: Springer. https://doi.org/10.1007/978-1-4612-4230-7_2.
Chapter Google Scholar
Fischer, G. H. (1995b). Some neglected problems in IRT. Psychometrika, 60(4), 459–487. https://doi.org/10.1007/bf02294324.
Article Google Scholar
Fischer, G. H., & Molenaar, I. W. (2012). Rasch models: Foundations, recent developments, and applications. Berlin: Springer. https://doi.org/10.1007/978-1-4612-4230-7.
Google Scholar
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer. https://doi.org/10.1007/978-1-4419-0742-4.
Book Google Scholar
Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8(3), 647–667.
Google Scholar
Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64(3), 273–294. https://doi.org/10.1007/bf02294296.
Article Google Scholar
Glas, C. A. W. (2009). Item parameter estimation and item fit analysis. In W. van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (pp. 269–288). New York, NY: Springer. https://doi.org/10.1007/978-0-387-85461-8_14.
Chapter Google Scholar
Glas, C. A. W. (2010). Testing fit to IRT models for polytomously scored items. In M. L. Nering & R. Ostini (Eds.), Handbook of polytomous item response theory models (pp. 185–210). New York, NY: Routledge.
Google Scholar
Glas, C. A. W. (2015). Item response theory models in behavioral social science: Assessment of fit. Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06436.pub2.
Glas, C. A. W., & Falcón, J. C. S. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. https://doi.org/10.1177/0146621602250530.
Article Google Scholar
Glas, C. A. W., & Jehangir, K. (2014). Modeling country-specific differential item functioning. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 97–115). Boca Raton, FL: Chapman and Hall/CRC. https://doi.org/10.1111/jedm.12095.
Google Scholar
Glas, C. A. W., & Linden, W. J. (2010). Marginal likelihood inference for a model for item responses and response times. British Journal of Mathematical and Statistical Psychology, 63(3), 603–626. https://doi.org/10.1348/000711009x481360.
Article PubMed Google Scholar
Hjort, N. L., & Koning, A. (2002). Tests for constancy of model parameters over time. Nonparametric Statistics, 14, 113–132. https://doi.org/10.1080/10485250211394.
Article Google Scholar
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Routledge.
Google Scholar
Katsikatsou, M., & Moustaki, I. (2016). Pairwise likelihood ratio tests and model selection criteria for structural equation models with ordinal variables. Psychometrika, 81(4), 1046–1068. https://doi.org/10.1007/s11336-016-9523-z.
Article PubMed Google Scholar
Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis, 56(12), 4243–4258. https://doi.org/10.1016/j.csda.2012.04.010.
Article Google Scholar
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. New York: Springer. https://doi.org/10.1007/978-1-4757-4310-4_10.
Book Google Scholar
Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56. https://doi.org/10.1177/0013164414529792.
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. New York: Routledge. https://doi.org/10.4324/9780203056615.
Google Scholar
Magis, D., Beland, S., & Raiche, G. (2015). difR: Collection of methods to detect dichotomous differential item functioning (DIF) [Computer software manual]. (R package version 4.6). https://doi.org/10.3758/brm.42.3.847.
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/brm.42.3.847.
Article PubMed Google Scholar
Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A counterexample with Angoff’s delta plot. Educational and Psychological Measurement, 73(2), 293–311. https://doi.org/10.1177/0013164412451903.
Article Google Scholar
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. https://doi.org/10.1016/0883-0355(89)90002-5.
Article Google Scholar
Merkle, E. C., Fan, J., & Zeileis, A. (2014). Testing for measurement invariance with respect to an ordinal variable. Psychometrika, 79, 569–584. https://doi.org/10.1007/s11336-013-9376-7.
Article PubMed Google Scholar
Merkle, E. C., & Zeileis, A. (2013). Tests of measurement invariance without subgroups: A generalization of classical methods. Psychometrika, 78, 59–82. https://doi.org/10.1007/s11336-012-9302-4.
Article PubMed Google Scholar
Millsap, R. E. (2005). Four unresolved problems in studies of factorial invariance. In A. Maydeu-Olivares & J. J. McArdle (Eds.), Contemporary psychometrics (pp. 153–171). Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Millsap, R. E. (2012). Statistical approaches to measurement invariance. New York: Routledge. https://doi.org/10.4324/9780203821961.
Google Scholar
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297–334. https://doi.org/10.1177/014662169301700401.
Article Google Scholar
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. https://doi.org/10.1177/014662169201600206.
Article Google Scholar
Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage. https://doi.org/10.4135/9781412993913.
Book Google Scholar
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. https://doi.org/10.1007/bf02294403.
Article Google Scholar
R Core Team. (2017). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/.
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02.
Article Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement,. https://doi.org/10.1007/bf03372160.
Google Scholar
Satorra, A. (1989). Alternative test criteria in covariance structure analysis: A unified approach. Psychometrika, 54, 131–151. https://doi.org/10.1007/bf02294453.
Article Google Scholar
Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555. https://doi.org/10.1007/s11336-003-1141-x.
Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292.
Article PubMed Google Scholar
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80, 289–316. https://doi.org/10.1007/s11336-013-9388-3.
Article PubMed Google Scholar
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.
Article Google Scholar
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408. https://doi.org/10.1007/bf02294363.
Article Google Scholar
Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175–186. https://doi.org/10.1007/bf02296273.
Article Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–172). Hillsdale, NJ: Lawrence Erlbaum Associates. https://doi.org/10.2307/1164765.
Google Scholar
Tutz, G., & Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1), 21–43. https://doi.org/10.1007/s11336-013-9377-6.
Article PubMed Google Scholar
Van den Noortgate, W., & De Boeck, P. (2005). Assessing and explaining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30(4), 443–464. https://doi.org/10.3102/10769986030004443.
Article Google Scholar
Verhagen, J., Levy, R., Millsap, R. E., & Fox, J.-P. (2016). Evaluating evidence for invariant items: A Bayes factor applied to testing measurement invariance in IRT models. Journal of Mathematical Psychology, 72, 171–182. https://doi.org/10.1016/j.jmp.2015.06.005.
Article Google Scholar
Wang, T., Merkle, E., & Zeileis, A. (2014). Score-based tests of measurement invariance: Use in practice. Frontiers in Psychology, 5(438), 1–11. https://doi.org/10.3389/fpsyg.2014.00438.
Google Scholar
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498. https://doi.org/10.1177/0146621603259902.
Article Google Scholar
Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57. https://doi.org/10.1177/0146621607314044.
Article Google Scholar
Zeileis, A. (2006). Implementing a class of structural change tests: An econometric computing approach. Computational Statistics & Data Analysis, 50(11), 2987–3008. https://doi.org/10.1016/j.csda.2005.07.001.
Article Google Scholar
Zeileis, A., & Hornik, K. (2007). Generalized M-fluctuation tests for parameter instability. Statistica Neerlandica, 61, 488–508. https://doi.org/10.1111/j.1467-9574.2007.00371.x.
Article Google Scholar
Zeileis, A., Leisch, F., Hornik, K., & Kleiber, C. (2002). strucchange: An R package for testing structural change in linear regression models: An R package for testing structural change in linear regression models. Journal of Statistical Software, 7(2), 1–38. https://doi.org/10.18637/jss.v007.i02.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
Ting Wang & Edgar C. Merkle
Department of Psychology, University of Zurich, Zurich, Switzerland
Carolin Strobl
Department of Statistics, Universität Innsbruck, Innsbruck, Austria
Achim Zeileis

Authors

Ting Wang
View author publications
You can also search for this author in PubMed Google Scholar
Carolin Strobl
View author publications
You can also search for this author in PubMed Google Scholar
Achim Zeileis
View author publications
You can also search for this author in PubMed Google Scholar
Edgar C. Merkle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Wang.

Additional information

Supported by National Science Foundation Grants SES-1061334 and 1460719.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 212 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, T., Strobl, C., Zeileis, A. et al. Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation. Psychometrika 83, 132–155 (2018). https://doi.org/10.1007/s11336-017-9591-8

Download citation

Received: 26 February 2016
Published: 17 November 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11336-017-9591-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Abstract

Access this article

Similar content being viewed by others

An R toolbox for score-based measurement invariance tests in IRT models

Score-based tests for detecting heterogeneity in linear mixed models

Examining Differential Item Functioning from a Multidimensional IRT Perspective

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (zip 212 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Score-Based Tests of Differential Item Functioning via Pairwise Maximum Likelihood Estimation

Abstract

Access this article

Similar content being viewed by others

An R toolbox for score-based measurement invariance tests in IRT models

Score-based tests for detecting heterogeneity in linear mixed models

Examining Differential Item Functioning from a Multidimensional IRT Perspective

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (zip 212 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation