Abstract
Differential item functioning (DIF) is a standard analysis for every testing company. Research has demonstrated that DIF can result when test items measure different ability composites, and the groups being examined for DIF exhibit distinct underlying ability distributions on those composite abilities. In this article, we examine DIF from a two-dimensional multidimensional item response theory (MIRT) perspective. We begin by delving into the compensatory MIRT model, illustrating and how items and the composites they measure can be graphically represented. Additionally, we discuss how estimated item parameters can vary based on the underlying latent ability distributions of the examinees. Analytical research highlighting the consequences of ignoring dimensionally and applying unidimensional IRT models, where the two-dimensional latent space is mapped onto a unidimensional, is reviewed. Next, we investigate three different approaches to understanding DIF from a MIRT standpoint: 1. Analytically Uniform and Nonuniform DIF: When two groups of interest have different two-dimensional ability distributions, a unidimensional model is estimated. 2. Accounting for complete latent ability space: We emphasize the importance of considering the entire latent ability space when using DIF conditional approaches, which leads to the mitigation of DIF effects. 3. Scenario-Based DIF: Even when underlying two-dimensional distributions are identical for two groups, differing problem-solving approaches can still lead to DIF. Modern software programs facilitate routine DIF procedures for comparing response data from two identified groups of interest. The real challenge is to identify why DIF could occur with flagged items. Thus, as a closing challenge, we present four items (Appendix A) from a standardized test and invite readers to identify which group was favored by a DIF analysis.
Similar content being viewed by others
References
Ackerman, T. A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 13–24. https://doi.org/10.1177/014662169101500103
Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91. https://doi.org/10.1111/j.1745-3984.1992.tb00368.x
Ackerman, T. A., & Evans, J. A. (1994). The influence of conditioning scores in performing DIF analyses. Applied Psychological Measurement, 18, 329–342. https://doi.org/10.1177/014662169401800404
Ackerman, T. A., McCallaum, B., & Ngerano, G. (2014). Differential item functioning from a compensatory-noncompensatory perspective. Invited address to the International Congress of Educational Research, Haceppette University, Ankara, Turkey.
Ackerman, T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://doi.org/10.1111/emip.12171
Ackerman,T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://doi.org/10.1111/emip.12171
Bauer, D. J., Belzak, W. C., & Cole, V. T. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27, 43–55. https://doi.org/10.1080/10705511.2019.1642754
Bolt, D. M., & Johnson. (2009). Addressing score bias and differential item functioning due to individual differences in response style. Applied Psychological Assessment,33(5), 335–352. https://doi.org/10.1177/0146621608329891
Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16(2), 129–147. https://doi.org/10.1177/014662169201600203
Camilli, G., & Penfield, D. A. (1997). Variance estimation for differential test functioning based on Mantel–Haenszel statistics. Journal of Educational Measurement, 34(2), 123–139. https://doi.org/10.1111/j.1745-3984.1997.tb00510.x
Carlson, J. E. (2017). Unidimensional vertical scaling in multidimensional space. ETS 11 Research Report Series, 2017(1), 1–28. https://doi.org/10.1002/ets2.12157
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276. https://doi.org/10.1207/s15327906mbr0102_10. PMID 26828106.
Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 31–44. https://doi.org/10.1111/j.1745-3992.1998.tb00619.x
Clauser, B. E., Nungester, R. J., & Swaminathan, H. (1996). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33(4), 454–464. https://doi.org/10.1111/j.1745-3984.1996.tb00501.x
Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 31–44. https://doi.org/10.1111/j.1745-3992.1998.tb00619.x
Cohen, A. S., Kim, S. H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17(4), 335–350. https://doi.org/10.1177/014662169301700402
De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559. https://doi.org/10.1007/s11336-008-9092-x
Fleishman, J. A. & Lawrence, W. F. (2003). Demographic variation in SF-12 scores: true differences or differential item functioning. Medical care, 41(7), 75–86. https://doi.org/10.1097/01.MLR.0000076052.42628
Ip, E. H. (2010). Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395–416. https://doi.org/10.1348/000711009x466835
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices. New York: Springer. https://doi.org/10.1007/978-1-4939-0317-7
Lim, H., Choe, E. M., & Han, K. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement. https://doi.org/10.1111/jedm.12313
Liu, Y., Zumbo, B., Gustason, P., Huang, Y., Kroc, E., & Wu, A. (2016). Investigating causal DIF via propensity score methods. Practical Assessment, Research and Evaluation, 21(13), 1–24. https://doi.org/10.7275/ewqz-n96
Ma, Y., Ackerman, T., Ip, E., & Chung, J. (2023). The effect of the projective IRT model on DIF detection. IMPS 2023 Annual Meeting, College Park, Maryland, United States.
Mazor, K. M., Hambleton, R. K., & Clauser, B. E. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22(4), 357–367. https://doi.org/10.1177/014662169802200404
Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23(4), 309–326. https://doi.org/10.1177/01466219922031437
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning detection and the Mantel–Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.129–145). Hillsdale, NJ: Lawrence Erlbaum. http://www.books.google.co.ke/books?isbn=1109103204.
Huang, P. H. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71, 499–522. https://doi.org/10.1111/bmsp. 121-130.
Junker, B., & Stout, W. F. (1991). Robustness of ability estimation when multiple traits are present with one trait dominant. Paper presented at the International Symposium on Modern Theories in Measurement: Problems and Issues. Montebello, Quebec.
Kok, F. (1988). Item bias and test multidimensionality. In R. Lange Heine & J. Rost (Eds.), Latent trait and latent class models (pp. 263–275). New York: Plenum Press. https://doi.org/10.1007/978-1-4757-5644-9_12
Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115–138. https://doi.org/10.1177/01466216000242002
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum https://eric.ed.gov/?id=ED312280
Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847
McKinley, R. L., & Reckase, M. D. (1982). The use of the general rasch model with multidimensional item response data.
Muthen, B., & Asparouhov, T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637–664. https://doi.org/10.1177/004912411770148
Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement 37(4), 357–373. http://www.jstor.org/stable/1435246
Penfield, R., & Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement, 43(4), 295–312. https://doi.org/10.1111/j.1745-3984.2006.00018.x
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502. https://doi.org/10.1007/BF02294403
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368. https://doi.org/10.1177/014662169501900405
Ramsay, J. O. (1990). A kernel smoothing approach to IRT modeling. Talk presented at the Annual Meeting of the Psychometric Society at Princeton New Jersey.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. https://doi.org/10.1007/978-0-387-89976-3
Shealy, R., & Stout, W. F. (1993). An item response theory model for test bias. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197–239). Hillsdale: Erlbaum.
Shealy, R., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–19. https://doi.org/10.1007/BF02294572
Spray, J., Davey, T., Reckase, M., Ackerman, T. & Carlson, J. (1990). Comparison of two logistic multidimensional item response theory models. ACT Research Report ONR90-8.
Stout, W. F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589–617.
Strachan, T., Ip, E., Fu, Y., Ackerman, T., Chen, S. H., & Willse, J. (2020). Robustness of projective IRT to misspecification of the underlying multidimensional model. Applied Psychological Measurement, 44(5), 362–375. https://doi.org/10.1177/0146621620909894
Strachan, T., Cho, U. H., Ackerman, T., Chen, S.-H., de la Torre, J., & Ip, E. (2022). Evaluation of the linear composite conjecture for unidimensional IRT scale for multidimensional responses. Applied Psychological Measurement, 46(5), 347–360. https://doi.org/10.1177/01466216221084218
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x
Sympson, B. (1978) A model for testing with Multidimensional items. In Weiss, D. J. (ed) Proceedings of the 1977 Computerized Adaptive Testing Conference, University of Minnesota, Minneapolis.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–169). Hillsdale NJ: Erlbaum.
Wang, M. (1985). Fitting a unidimensional model multidimensional item response data: The effects of latent space misspecification on the application of IRT Unpublished manuscript, University of Iowa.
Williams, N. J., & Beretvas, S. N. (2006). DIF identification using HGLM for polytomous items. Applied Psychological Measurement, 30, 22–42. https://doi.org/10.1177/0146621605279867
Wolfram, 2020 Wolfram Research, Inc., (2020). Mathematica, (Version 12.2), [Computer Software]. Champaign, IL.
Zhang, J., & Stout, W. F. (1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64(2), 213–249.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open science statement
Data and the Mathematica code used in the illustration will be made available on the Open Science Framework upon publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Groups indicated as being favored in the Mantel–Haenszel analysis.
-
Item 1:
Male examinees
-
Item 2:
Black examinees
-
Item 3:
No DIF
-
Item 4:
Male examinees. This is the only item which has a possible explanation: that males, for the most part, know more about cars than females.
Appendix B
Example illustrating formulation of how the unidimensional 2PL model gets mapped into a two-dimensional latent ability space:
Assume you want to find the unidimensional 2PL \(\hat{a}\) and \(\hat{b}\) value for a two-item test where the two-dimensional compensatory parameters are given as \(A = [\{1.5,0\}, \{0,1.5\}]\) and \(D = \{.5,.5\}\) and the underlying model is given as
It is also given that the underlying two-dimensional distribution is a bivariate normal with a mean vector of {0,0}and the covariance matrix, \(\Omega \), as [{1,.4}, {.4,1}]. Note these are chosen only for illustration purposes. Item 1 measures only \(\theta _{1}\) and item 2 measures only \(\theta _{2}\).
Following the work of Wang (1986) and Camilli (1992), we first determine the Cholesky decomposition, L, of \(\Omega \). L[\(\{1.,0.4\},\{0.,0.91651\}]\). To compute the reference composite, we first need to calculate the L’A’AL matrix which equals, \(\{2.25,0.9\},\{0.9,2.25\}\). The eigenvalues of this matrix are \(3.15,1.35\}\) and the eigenvectors associated with the eigenvalues, w\(_{\textrm{ij}}\), are \(\{0.7071,0.7071\},\{-0.7071,0.7071\}\).
The reference composite is then calculated as the arccosine of the first element of the eigenvector associated with the largest eigenvalue. The arccosine of.7071 corresponds to 45\(^\circ \) which corresponds to the reference composite direction from the positive \(\theta _{\textrm{1}}\)-axis. This is the composite that would represent the unidimensional \(\theta \)-scale if the data were fit to the 2PL model.
It should also be noted that the first and second factor scores, \(\upsilon _{1}\) and \(\upsilon _{2}\) are then defined as:
In Fig. 20, the left panel is a contour plot of Item 1 with the reference composite (\(\upsilon _{\textrm{1}})\) direction indicated with a solid red arrow and the perpendicular \(\upsilon _{\textrm{2}}\) direction indicated with a dotted red arrow. We then substitute \(\upsilon _{\textrm{1}}\) and \(\upsilon _{\textrm{2 }}\) in for \(\theta _{\textrm{1}}\) and \(\theta _{\textrm{2 in}}\) the compensatory model to get
To determine G (\(\upsilon \)2|\(\upsilon \)1), we must first rotate the bivariate normal distribution 45o and then determine the conditional distribution. Assuming \(\Sigma \) is the original covariance \({\Sigma }=\left[ {\begin{array}{*{20}c} \sigma _{1}^{2} &{} \rho \sigma _{1}\sigma _{2}\\ \rho \sigma _{1}\sigma _{2} &{} \sigma _{2}^{2}\\ \end{array} } \right] \)and R\(_{{\uptheta }}\) is the rotation matrix,
then the rotated mean vector, \(\mu \)’, and rotated covariance matrix, \(\Sigma \)’, are given by \( \mu ^{'}=R_{\theta }\mu =\frac{\sqrt{2} }{2} \left[ {\begin{array}{*{20}c} \mu _{1}-\mu _{2}\\ \mu _{1}+\mu _{2}\\ \end{array} } \right] \) and \({\Sigma }^{'}=R_{\theta }{\Sigma }R_{\theta }^{T}= \frac{1}{2}\left[ {\begin{array}{*{20}c} \sigma _{1}^{2}+\sigma _{2}^{2}-2\rho \sigma _{1}\sigma _{2} &{} \sigma _{1}^{2}-\sigma _{2}^{2}\\ \sigma _{1}^{2}-\sigma _{2}^{2} &{} \sigma _{1}^{2}+\sigma _{2}^{2}+2{\rho \sigma }_{1}\sigma _{2}\\ \end{array} } \right] ,\) where \(\sigma _{1}^{2}\), \(\sigma _{2}^{2}\) and \(\rho \) are the original variances and correlation of the original random variables.
The rotated mean vector is [0,0] and the rotated covariance matrix, \({\Sigma }^{'}\) is [{.6,0}, {0,1.4}]. The formula for the conditional distribution of G(\(\upsilon \)1|\(\upsilon \)2) equals (Fig. 21)
In Fig. 22 on the left are conditional normal distributions, \(G\left( {{\upupsilon 1}}\vert {{\upupsilon 2}}\right) ,\) for \(\upsilon \)1 \(=\) -2, -1,0,1,2. On the right are the conditional ICCs, \(p\left( {u_{ij}=1}\vert {\upsilon _{1},\upsilon _{2}}\right) \), for \(\upsilon \)1 \(=\) -2, -1,0,1,2.
Using the formula
4where \(\textrm{d}\upsilon _{2}=\).001 we can estimate the unidimensional ICC for values of \(\upsilon \)1 values of -2, -1,0,1,2. These values are.13,.34,.64,.86, and.95, respectively. Using Camilli’s derivational formulas,
and
we obtain the 2PL item parameter estimates: \(\hat{a}= .73\) and \(\hat{b}=-.47. \) Fig. 23 shows the estimated ICC using the \(\hat{a}\) and \(\hat{b}\) and the five color-coded estimated (\(\upsilon _{1},p)\) values. Figure 24 illustrates three different perspectives of all the elements of Camilli’s formulation, including the M2PL response surface and corresponding contour plot, the RC (v1) which represents the estimated unidimensional scale, v2 (the orthogonal second principal component, the RC plane, the underlying conditional latent ability distribution, \(G\left( {{\upupsilon 1}}\vert {{\upupsilon 2}}\right) \), and the estimated unidimensional ICC.
Appendix C
Ackerman and Xie (2019) created a DIF Graphical Simulator. This simulator enables researchers to modify the underlying two-dimensional latent distributions for the Reference and Focal groups and the M2PL item parameters for a given suspect item. Using the Camilli (1992) analytical derivations, the 2PL unidimensional discrimination (a) and difficulty (b) parameters are estimated and the resulting ICC is illustrated. A mean–mean transformation is used to place the Focal group’s estimated parameters onto the scale of the Reference group. The transformed ICCs are then displayed, and the degree of misfit, defined as: \(\sum \nolimits _{\theta =-3}^{\theta =3} {({P(\theta )}_{Ref}-{P(\theta )}_{Foc})}^{2} \), is calculated. The DIF Graphical Simulator is shown in Fig. 25
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ackerman, T.A., Ma, Y. Examining Differential Item Functioning from a Multidimensional IRT Perspective. Psychometrika (2024). https://doi.org/10.1007/s11336-024-09965-6
Received:
Published:
DOI: https://doi.org/10.1007/s11336-024-09965-6