Abstract
Several methods used to examine differential item functioning (DIF) in Patient-Reported Outcomes Measurement Information System (PROMIS®) measures are presented, including effect size estimation. A summary of factors that may affect DIF detection and challenges encountered in PROMIS DIF analyses, e.g., anchor item selection, is provided. An issue in PROMIS was the potential for inadequately modeled multidimensionality to result in false DIF detection. Section 1 is a presentation of the unidimensional models used by most PROMIS investigators for DIF detection, as well as their multidimensional expansions. Section 2 is an illustration that builds on previous unidimensional analyses of depression and anxiety short-forms to examine DIF detection using a multidimensional item response theory (MIRT) model. The Item Response Theory-Log-likelihood Ratio Test (IRT-LRT) method was used for a real data illustration with gender as the grouping variable. The IRT-LRT DIF detection method is a flexible approach to handle group differences in trait distributions, known as impact in the DIF literature, and was studied with both real data and in simulations to compare the performance of the IRT-LRT method within the unidimensional IRT (UIRT) and MIRT contexts. Additionally, different effect size measures were compared for the data presented in Section 2. A finding from the real data illustration was that using the IRT-LRT method within a MIRT context resulted in more flagged items as compared to using the IRT-LRT method within a UIRT context. The simulations provided some evidence that while unidimensional and multidimensional approaches were similar in terms of Type I error rates, power for DIF detection was greater for the multidimensional approach. Effect size measures presented in Section 1 and applied in Section 2 varied in terms of estimation methods, choice of density function, methods of equating, and anchor item selection. Despite these differences, there was considerable consistency in results, especially for the items showing the largest values. Future work is needed to examine DIF detection in the context of polytomous, multidimensional data. PROMIS standards included incorporation of effect size measures in determining salient DIF. Integrated methods for examining effect size measures in the context of IRT-based DIF detection procedures are still in early stages of development.
Similar content being viewed by others
References
Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.
Ankenmann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277–300. https://doi.org/10.1111/j.1745-3984.1999.tb00558.x.
Baker, F. B. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory. Madison: University of Wisconsin, Laboratory of Experimental Design.
Bauer, D., Belzak, W., & Cole, V. (2019). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling A: Multidisciplinary Journal,. https://doi.org/10.1080/10705511.2019.1642754.
Belzak, W., & Bauer, D. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods,. https://doi.org/10.1027/met0000253.
Benjamini, Y., & Hochberg, Y. (1995). Controlling for the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289–300. https://doi.org/10.2307/2346101.
Bjorner, J. B., Rose, M., Gandek, B., Stone, A. A., Junghaenel, D. U., & Ware, J. E. (2014). Difference in method of administration did not significantly impact item response: An IRT-based analysis from the Patient-Reported Outcomes Measurement Information System (PROMIS) initiative. Quality of Life Research, 23, 217–227.
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113–141. https://doi.org/10.1207/S15324818AME1502_01.
Boorsboom, D. (2006). Commentary: When does measurement invariance matter? Medical Care, 44(11), S176–81.
Boorsboom, D., Mellenbergh, G. J., & van Heerdon, J. (2002). Different kinds of DIF: A distinction between absolute and relative forms of measurement invariance and bias. Applied Psychological Measurement, 26, 433–450.
Bulut, O., & Suh, Y. (2017). Detecting multidimensional differential item functioning with the multiple indicators multiple causes model, the item response theory likelihood ratio test, and logistic regression. Frontiers in Education, 2, 51. https://doi.org/10.3389/feduc.2017.00051.
Byrne, B. M., Shavelson, R. J., & Muthén, B. O. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–566. https://doi.org/10.1037/0033-2909.105.3.456.
Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61, 309–329. https://doi.org/10.1348/000711007X249603.
Cai, L. (2013). FlexMIRT version 2: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.
Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT Modeling [Computer software]. Lincolnwood, IL: Scientific Software International Inc.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.
Carle, A. C., Cella, D., Cai, L., Choi, S. W., Crane, P. K., Curtis, S. M., et al. (2011). Advancing PROMIS’s methodology: Results of the third PROMIS Psychometric Summit. Expert Review of Pharmacoeconomics & Outcome Research, 11(6), 677–684. https://doi.org/10.1586/erp.11.74.
Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, K., Reeve, B., Ader, D., Fries, J. F., Bruce, B., & Rose, M., on behalf of the PROMIS Cooperative Group. (2007). The patient-reported outcomes measurement information system (PROMIS): Progress of an NIH roadmap cooperative group during its first two years. Medical Care, 45(5 Suppl 1), S3–S11. https://doi.org/10.1097/01.mlr.0000258615.42478.55.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical software, 48(6), 1–29.
Chalmers, R. P. (2016). A differential response functioning framework for understanding item, bundle, and test bias. Doctoral Dissertation, York University, Toronto, Ontario. https://pdfs.semanticscholar.org
Chalmers, R. P. (2018). Model-based measures for detecting and quantifying response bias. Psychometrika, 83, 696–732. https://doi.org/10.1007/s11336-018-9626-9.
Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76, 114–140.
Chang, Y.-W., Hsu, N.-J., & Tsai, R.-C. (2017). Unifying differential item functioning in factor analysis for categorical data under a discretization of a normal variant. Psychometrika, 82(2), 382–406. https://doi.org/10.1007/s11336-017-9562-0.
Chen, J.-H., Chen, C.-T., & Shih, C.-L. (2013). Improving the control of type I error rate in assessing differential item functioning for hierarchical generalized linear models when impact is present. Applied Psychological Measurement, 38, 18–36. https://doi.org/10.1177/0146621613488643.
Cheng, C.-P., Chen, C.-C., & Shih, C.-L. (2020). An exploratory strategy to identify and define sources of differential item functioning. Applied Psychological Measurement, 4, 548–560. https://doi.org/10.1177/014662/620931/90.
Cheng, Y., Shao, C., & Lathrop, Q. N. (2016). The mediated MIMIC model for understanding the underlying mechanisms of DIF. Educational and Psychological Measurement, 76(1), 43–63.
Cheung, G. W., & Rensvold, R. B. (2003). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. https://doi.org/10.1207/S15328007SEM0902_5.
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1–30. https://doi.org/10.18637/jss.v039.i08.
Choi, S. W., Reise, S. P., Pilkonis, P. A., Hays, R. D., & Cella, D. (2010). Efficiency of static and computer adaptive short forms compared to full-length measures of depressive symptoms. Quality of Life Research, 19, 125–136.
Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1993). The effects of purification of the matching criterion on the identification of DIF using the Mantel–Haenszel procedure. Applied Measurement in Education, 6, 269–279.
Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335–350. https://doi.org/10.1177/014662169301700402.
Cohen, P., Cohen, J., Teresi, J., Marchi, P., & Velez, N. (1990). Problems in the measurement of latent variables in structural equation causal models. Applied Psychological Measurement, 14(2), 183–196. https://doi.org/10.1177/014662169001400207.
Crane, P. K., Gibbons, L. E., Jolley, L., & van Belle, G. (2006). Differential item functioning analysis with ordinal logistic regression techniques: Difdetect and difwithpar. Medical Care, 44, S115–S123. https://doi.org/10.1097/01.mlr.0000245183.28384.ed.
Crane, P. K., Gibbons, L. E., Ocepek-Welikson, K., Cook, K., Cella, D., & Teresi, J. A. (2007). A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Quality of Life Research, 16, 69–84. https://doi.org/10.1007/s11136-007-9185-5.
Crane, P. K., van Belle, G., & Larson, E. B. (2004). Test bias in a cognitive test: Differential item functioning in the CASI. Statistics in Medicine, 23, 241–256. https://doi.org/10.1002/sim.1713.
Culpepper, S. A., Aguinis, H., Kern, J. L., & Millsap, R. (2019). High-stakes testing case study: A latent variable approach for assessing measurement and prediction invariance. Psychometrika, 84, 285–309. https://doi.org/10.1007/s11336-018-9549-2.
DeMars, C. E. (2010). Type 1 error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70, 961–972. https://doi.org/10.1177/0013164410366691.
DeMars, C. E. (2015). Modeling DIF for simulations: Continuous or categorical secondary trait? Psychological Test and Assessment Modeling, 57, 279–300.
Edelen, M., Stucky, B., & Chandra, A. (2015). Quantifying “problematic” DIF within an IRT framework: Application to a cancer stigma index. Quality of Life Research, 24, 95–103. https://doi.org/10.1007/s11136-013-0540-4.
Egberink, I. J. L., Meijer, R. R., & Tendeiro, J. N. (2015). Investigating measurement invariance in computer-based personality testing: The impact of using anchor items on effect size indices. Educational and Psychological Measurement, 75, 126–145. https://doi.org/10.1177/0013164414520965.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel–Haenszel, SIBTEST and the IRT likelihood ratio test. Applied Psychological Measurement, 29, 278–295. https://doi.org/10.1177/0146621605275728.
Fleer, P. F. (1993). A Monte Carlo assessment of a new measure of item and test bias (p. 2266, Vol. 54, No. 04B), Illinois Institute of Technology, Dissertation Abstracts International.
Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous DFIT framework. Applied Psychological Measurement, 23, 309–32. https://doi.org/10.1177/01466219922031437.
Furlow, C. F., Ross, T. R., & Gagné, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33(6), 441–464. https://doi.org/10.1177/0146621609331959.
Gelin, M. N., & Zumbo, B. D. (2003). Differential item functioning results may change depending on how an item is scored: An illustration with the center for epidemiologic studies depression scale. Educational and Psychological Measurement, 63(1), 65–74. https://doi.org/10.1177/0013164402239317.
González-Betanzos, F., & Abad, F. J. (2012). The effects of purification and the evaluation of differential item functioning with the likelihood ratio test. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 130–145. https://doi.org/10.1027/1614-2241/a000046.
Gómez-Benito, J., Dolores-Hidalgo, M., & Zumbo, B. D. (2013). Effectiveness of combining statistical tests and effect sizes when using logistic discriminant function regression to detect differential item functioning for polytomous items. Educational and Psychological Measurement, 73, 875–897. https://doi.org/10.1177/0013164413492419.
Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups?: Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44(11), S78–S94.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications Inc.
Herrel, F. E. (2009). Design; design package. R package version 2:3.0. Retrieved from http://CRANR-project.org/package=Design
Hidalgo, M. D., Gomez-Benito, J., & Zumbo, B. D. (2014). Binary logistic regression analysis for detecting differential item functioning: Effectiveness of \(\text{ R}^{{2}}\) and delta log odds ratio effect size measures. Educational and Psychological Measurement, 74, 927–949. https://doi.org/10.1177/0013164414523618.
Houts, C. R., & Cai, L. (2013). FlexMIRT user’s manual version 2: Flexible multilevel multidimensional item analysis and test scoring. Chapel Hill, NC: Vector Psychometric Group.
Jensen, R. E., Moinpour, C. M., Keegan, T. H. M., Cress, R. D., Wu, X.-C., Paddock, L. A., et al. (2016a). The Measuring Your Health Study: Leveraging community-based cancer registry recruitment to establish a large, diverse cohort of cancer survivors for analyses of measurement equivalence and validity of thepatient-reported Outcomes Measurement Information System®(PROMIS®) short form items. Psychological Test and Assessment Modeling, 58(1), 99–117.
Jensen, R. E., King-Kallimanis, B. L., Sexton, E., Reeve, B. B., Moinpour, C. M., Potosky, A. L., et al. (2016b). Measurement properties of the PROMIS\(^{\textregistered }\) Sleep Disturbance short form in a large, ethnically diverse cancer cohort. Psychological Test and Assessment Modeling, 58(2), 353–370.
Jin, K. Y., Chen, H. F., & Wang, W. C. (2018). Using odds ratios to detect differential item functioning. Applied Psychological Measurement, 42, 613–29.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349. https://doi.org/10.1207/S15324818AME1404_2.
Jones, R. N. (2006). Identification of measurement differences between English and Spanish language versions for the Mini-Mental State Examination: Detecting differential item functioning using MIMIC modeling. Medical Care, 44(11 Suppl 3), S124–S133. https://doi.org/10.1097/01.mlr.0000245250.50114.0f.
Jones, R. N. (2019). Differential item functioning and its relevance to epidemiology. Current Epidemiology Reports,. https://doi.org/10.1007/s40471-019-00194-5.
Jones, R. N., Tommet, D., Ramirez, M., Jensen, R. E., & Teresi, J. A. (2016). Differential item functioning in Patient Reported Outcomes Measurement Information System (PROMIS\(^{\textregistered }\)) Physical Functioning short forms: Analyses across ethnically diverse groups. Psychological Test and Assessment Modeling, 58(2), 371–402.
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 408–426. https://doi.org/10.1007/BF02291366.
Jöreskog, K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 10, 631–639. https://doi.org/10.2307/2285946.
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387. https://doi.org/10.1207/S15327906347-387.
Jöreskog, K., & Sorbom, D. (1996). LISREL8: Analysis of linear structural relationships: Users Reference Guide. Lincolnwood: Scientific Software International Inc.
Junker, B. W. (1991). Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika, 56, 255–278. https://doi.org/10.1007/BF02294462.
Kahraman, N., DeBoeck, P., & Janssen, R. (2009). Modeling DIF in complex response data using test design strategies. International Journal of Testing, 8, 151–166. https://doi.org/10.1080/15305050902880744.
Kim, E. S., & Yoon, M. (2011). Testing measurement invariance: A comparison of multiple group categorical CFA and IRT. Structural Equation Modeling, 18, 212–228. https://doi.org/10.1080/10705511-2011.557337.
Kim, E. S., Yoon, M., & Lee, T. (2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72, 469–492. https://doi.org/10.1177/0013164411427395.
Kim, S.-H., & Cohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22, 345–355. https://doi.org/10.1177/014662169802200403.
Kim, S.-H., Cohen, A. S., Alagoz, C., & Kim, S. (2007). DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement, 44(2), 93–116. https://doi.org/10.1111/j.1745-3984.2007.00029.x.
Kleinman, M., & Teresi, J. A. (2016). Differential item functioning magnitude and impact measures from item response theory models. Psychological Test and Assessment Modeling, 58, 79–98.
Kopf, J., Zeileis, A., & Stobl, C. (2015a). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39, 83–103. https://doi.org/10.1177/0146621614544195.
Kopf, J., Zeileis, A., & Stobl, C. (2015b). Anchor selection strategies for DIF analysis: Review, assessment and new approaches. Educational and Psychological Measurement, 75, 22–56. https://doi.org/10.1177/0013164414529792.
Langer, M. M. (2008). A re-examination of Lord’s Wald test for differential item functioning using item response theory and modern error estimation (Doctoral dissertation, University of North Carolina at Chapel Hill library). http://search.lib.unc.edu/search?R=UNCb5878458.
Lee, S., Bulut, O., & Suh, Y. (2017). Multidimensional extension of multiple indicators multiple causes models to detect DIF. Educational and Psychological Measurement, 77(4), 545–569.
Li, Y., Brooks, G. P., & Johanson, G. A. (2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72, 847–861. https://doi.org/10.1177/0013164411432333.
Liu, Y., Magnus, B. E., & Thissen, D. (2016). Modeling and testing differential item functioning in unidimensional binary item response models with a single continuous covariate: A functional data analysis approach. Psychometrika, 81, 371–398.
Lopez Rivas, G. E., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33, 251–265. https://doi.org/10.1177/0146621608321760.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Lord, F. M., Novick, M. R., & (with contributions by A. Birnbaum). (1968). Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley Publishing Company Inc.
Mazor, K. M., Hambleton, R. K., & Clauser, B. E. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22, 357–367. https://doi.org/10.1177/014662169802200404.
McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99–114. https://doi.org/10.1177/01466210022031552.
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of IRT and CFA methodologies for establishing measurement equivalence. Organizational Research Methods, 7, 361–388. https://doi.org/10.1177/1094428104268027.
Meade, A., Lautenschlager, G., & Johnson, E. (2007). A Monte Carlo examination of the sensitivity of the differential functioning of items and tests framework for tests of measurement invariance with Likert data. Applied Psychological Measurement, 31, 430–455. https://doi.org/10.1177/0146621606297316.
Meade, A. W., & Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97, 1016–1031. https://doi.org/10.1037/a0027934.
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. https://doi.org/10.1016/0883-0355(89)90002-5.
Mellenbergh, G. J. (1994). Generalized linear item response theory. Psychological Bulletin, 115, 302–307. https://doi.org/10.1037/0033-2909.115.2.300.
Meredith, W. (1964). Notes on factorial invariance. Psychometrika, 29, 177–185. https://doi.org/10.1007/BF02289699.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. https://doi.org/10.1007/BF02294825.
Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical Care, 44(Suppl 3), S69–S77. https://doi.org/10.1097/01.mlr.0000245438.73837.89.
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334. https://doi.org/10.1177/014662169301700401.
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195. https://doi.org/10.1007/BF02293979.
Montoya, A. K., & Jeon, M. (2020). MIMIC models for uniform and nonuniform DIF as moderated mediation models. Applied Psychological Measurement, 44(2), 118–136.
Mukherjee, S., Gibbons, L. E., Kristjansson, E., & Crane, P. K. (2013). Extension of an iterative hybrid ordinal logistic regression/item response theory approach to detect and account for differential item functioning in longitudinal data. Psychological Test and Assessment Modeling, 55(2), 127–147.
Muthén, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115–132. https://doi.org/10.1007/BF02294210.
Muthén, B. (1989). Latent variable modeling in heterogeneous populations. Meetings of Psychometric Society (1989, Los Angeles, California and Leuven, Belgium). Psychometrika, 54(4), 557–585.
Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81–117.
Muthén, B., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus (p 16). Los Angeles: University of California.
Muthén, L. K. & Muthén, B. O. (1998–2019). M-PLUS Users Guide. Sixth Edition. Los Angeles, California: Authors Muthén and Muthén.
Muthén, B., du Toit, S.H.C. & Spisic, D. (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. Unpublished Technical Report. Available at https://www.statmodel.com/wlscv.shtml.
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257–274.
Oort, E. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107–124.
Orlando-Edelen, M., Stuckey, B. D., & Chandra, A. (2015). Quantifying ‘problematic’ DIF within an IRT framework: Application to a cancer stigma index. Quality of Life Research, 24, 95–103. https://doi.org/10.1007/s11136-013-0540-4.
Orlando-Edelen, M., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory ad the likelihood-based model comparison approach: Applications to the Mini-Mental State Examination. Medical Care, 44, S134–S142. https://doi.org/10.1097/01.mlr.0000245251.83359.8c.
Oshima, T. C., Kushubar, S., Scott, J. C., & Raju, N. S. (2009). DFIT8 for Window User’s Manual: Differential functioning of items and tests. St. Paul MN: Assessment Systems Corporation.
Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance of the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43, 1–17. https://doi.org/10.1111/j.1745-3984.2006.00001.x.
Paz, S. H., Spritzer, K. L., Morales, L., & Hays, R. D. (2013). Evaluation of the Patient-Reported outcomes Information System (PROMIS) Spanish-language physical functioning items. Quality of Life Research, 22, 1819–1830. https://doi.org/10.1007/s11136-012-0292-6.
Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., & Cella, D. (2011). Item banks for measuring emotional distress from the patient-reported outcomes measurement information system (PROMIS): Depression, Anxiety and Anger. Assessment, 18, 263–283.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502. https://doi.org/10.1007/BF02294403.
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207. https://doi.org/10.1177/014662169001400208.
Raju, N. S. (1999). DFITP5: A Fortran program for calculating dichotomous DIF/DTF [Computer program]. Chicago: Illinois Institute of Technology.
Raju, N. S., Fortmann-Johnson, K. A., Kim, W., Morris, S. B., Nering, M. L., & Oshima, T. C. (2009). The item parameter replication method for detecting differential functioning in the polytomous DFIT framework. Applied Psychological Measurement, 33, 133–147. https://doi.org/10.1177/0146621608319514.
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–528. https://doi.org/10.1037//0021-9010.87.3.517.
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368. https://doi.org/10.1177/014662169501900405.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: DenmarksPaedagogiskeInstitut (Danish Institute of Educational Research).
Raykov, T., Marcoulides, G. A., Menold, N., & Harrison, M. (2019). Revisiting the bi-factor model: Can mixture modeling help assess its applicability? Structural Equation Modeling, 26, 110–118.
Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15, 361–373.
Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., et al. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcome Measurement Information System (PROMIS). Medical Care, 45(5 Suppl 1), S22–S31. https://doi.org/10.1097/01.mlr.0000250483.85507.04.
Reeve, B. B., & Teresi, J. A. (2016). Overview to the two-part series: Measurement equivalence of the Patient Reported Outcomes Measurement Information System\(^{@}\) (PROMIS)\(^{@}\) short forms. Psychological Test and Assessment Modeling, 58(1), 31–35.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667–696. https://doi.org/10.1080/00273171.2012.715555.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566. https://doi.org/10.1037/0033-2909.114.3.552.
Rikis, D. R. J., & Oshima, T. C. (2017). Effect of purification procedures on DIF analysis in IRTPRO. Educational and Psychological Measurement, 77, 415–428.
Rizopoulus, D. (2006). Ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17, 1–25. https://doi.org/10.18637/jss.v017.i05.
Rizopoulus, D. (2009). Ltm: Latent Trait Models under IRT. http://cran.rproject.org/web/packages/ltm/index.html.
Rouquette, A., Hardouin, J. B., Vanhaesebrouck, A., Véronique Sébille, V., & Coste, J. (2019). Differential item functioning (DIF) in composite health measurement scale: Recommendations for characterizing DIF with meaningful consequences within the Rasch model framework. PLoS ONE, 14(4), e0215073. https://doi.org/10.1371/journal.pone.0215073.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 100–114. https://doi.org/10.1007/BF02290599.
Schalet, B. D., Pilkonis, P. A., Yu, L., Dodds, N., Johnston, K. L., Yount, S., et al. (2016). Clinical validity of PROMIS depression, anxiety and anger across diverse clinical groups. Journal of Clinical Epidemiology, 73, 119–127. https://doi.org/10.1016/j.jclinepi2015.08.036.
Setodji, C. M., Reise, S. P., Morales, L. S., Fongwam, N., & Hays, R. D. (2011). Differential item functioning by survey language among older Hispanics enrolled in Medicare Managed Care a new method for anchor item selection. Medical Care, 49, 461–468. https://doi.org/10.1097/MLR.0b013e318207edb5.
Seybert, J., & Stark, S. (2012). Iterative linking with the differential functioning of items and tests (DFIT) Method: Comparison of testwide and item parameter replication (IPR) critical values. Applied Psychological Measurement, 36, 494–515. https://doi.org/10.1177/0146621612445182.
Shealy, R. T., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–194.
Shih, C.-L., Liu, T.-H., & Wang, W.-C. (2014). Controlling type 1 error rates in assessing DIF for logistic regression method with SIBTEST regression correction procedure and DIF-free-then-DIF strategy. Educational and Psychological Measurement, 74, 1018–1048. https://doi.org/10.1177/0013164413520545.
Shih, C.-L., & Wang, W.-C. (2009). Differential item functioning detection using multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33, 184–199. https://doi.org/10.1177/0146621608321758.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497–508. https://doi.org/10.1037/0021-9010.89.3.497.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292.
Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415. https://doi.org/10.1007/s11136-011-9969-5.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
Stout, W. F. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika, 52, 589–617.
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensional assessment and ability estimation. Psychometrika, 55, 293–326.
Stout, W., Li, H., Nandakumar, R., & Bolt, D. (1997). MULTISIB—A procedure to investigate DIF when a test is intentionally multidimensional. Applied Psychological Measurement, 21, 195–213.
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80, 289–316. https://doi.org/10.1007/s11366-013-9388-3.
Suh, Y., & Cho, S.-J. (2014). Chi-square difference tests for detecting differential functioning in a multidimensional IRT model: A Monte Carlo study. Applied Psychological Measurement, 38(5), 359–375.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408. https://doi.org/10.1007/BF02294363.
Taple, B. J., Griffith, J. W., & Wolf, M. S. (2019). Interview administration of PROMIS depression and anxiety short forms. Health Literacy Research Practice, 6, e196–e204. https://doi.org/10.3928/24748307-20190626-01.
Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44(Suppl. 11), S152–S170. https://doi.org/10.1097/01.mlr.0000245142.74628.ab.
Teresi, J. A. (2019). Applying and Acting on DIF. Moderator at the 2019 PROMIS Psychometric Summit, Northwestern University, Chicago, IL.
Teresi, J. A. & Jones, R. N. (2013). Bias in psychological assessment and other measures. In K. F. Geisinger (Ed.), APA Handbook of Testing and Assessment in Psychology: Vol 1. Test Theory and Testing and Assessment in Industrial and Organizational Psychology (pp. 139–164). American Psychological Association: Washington, DC. https://doi.org/10.1037/14047-008.
Teresi, J. A., & Jones, R. N. (2016). Methodological issues in examining measurement equivalence in patient reported outcomes measures: Methods overview to the two-part series, “Measurement Equivalence of the Patient Reported Outcomes Measurement Information System (PROMIS) Short Form Measures”. Psychological Test and Assessment Modeling, 58(1), 37–78.
Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2000). Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine, 19, 1651–1683.
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Cook, K. F., Crane, P. K., Gibbons, L. E., et al. (2007). Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): Applications (with illustrations) to measure of physical functioning ability and general distress. Quality Life Research, 16, 43–68. https://doi.org/10.1007/s11136-007-9186-4.
Teresi, J., Ocepek-Welikson, K., Kleinman, M., Eimicke, J. E., Crane, P. K., Jones, R. N., et al. (2009). Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychology Science Quarterly, 51(2), 148–180. PMCID: PMC2844669. NIHMSID: 136951.
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Ramirez, M., & Kim, G. (2016a). Psychometric properties and performance of the Patient Reported Outcomes Measurement Information System®(PROMIS®) depression short forms in ethnically diverse groups. Psychological Test and Assessment Modeling, 58(1), 141–181.
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Ramirez, M., & Kim, G. (2016b). Measurement equivalence of the Patient Reported Outcomes Measurement Information System®(PROMIS®) anxiety short forms in ethnically diverse groups. Psychological Test and Assessment Modeling, 58(1), 183–219.
Teresi, J. A., Ramirez, M., Jones, R. N., Choi, S., & Crane, P. K. (2012). Modifying measures based on Differential Item Functioning (DIF) impact analyses. Journal of Aging & Health, 24(6), 1044–1076. https://doi.org/10.1177/089826412436877.
Teresi, J. A., & Reeve, B. B. (2016). Epilogue to the two-part series: Measurement equivalence of the Patient Reported Outcomes Measurement Information System (PROMIS) short forms. Psychological Tests and Assessment Modeling, 58(2), 423–433.
Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the Computation of the Statistics Involved in Item Response Theory Likelihood Ratio Tests for Differential Item Functioning. Unpublished manual from the L.L. Thurstone Psychometric Laboratory: University of North Carolina at Chapel Hill.
Thissen, D. (1991). MULTILOG\(^{{\rm TM}}\)user’s guide multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software Inc.
Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini–Hochberg procedure for controlling the false discovery rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27, 77–83. https://doi.org/10.3102/10769986027001077.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–169). Hillsdale, New Jersey: Lawrence Erlbaum, Associates.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Inc.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. https://doi.org/10.1177/109442810031002.
Wainer, H. (1993). Model-based standardization measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning (pp. 123–135). Hillsdale NJ: Lawrence Erlbaum Inc.
Wang, T., Strobl, C., Zeileis, A., & Merkle, E. C. (2018). Score-based test of differential item functioning via pairwise maximum likelihood estimation. Psychometrika, 83, 132–135. https://doi.org/10.1007/s11336-017-9591-8.
Wang, W. (2004). Effects of anchor item methods on detection of differential item functioning within the family of Rasch models. Journal of Experimental Education, 72, 221–261. https://doi.org/10.3200/JEXE.72.3.221-261.
Wang, W.-C., & Shih, C.-L. (2010). MIMIC methods for assessing differential item functioning in polytomous items. Applied Psychological Measurement, 34, 166–180. https://doi.org/10.1177/0146621609355279.
Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then DIF strategy for the assessment of differential item functioning (DIF). Educational and Psychological Measurement, 72, 687–708. https://doi.org/10.1177/0013164411426157.
Wang, W.-C., Shih, C.-L., & Yang, C.-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713–731. https://doi.org/10.1177/0013164409332228.
Wang, W. C., & Yeh, Y. L. (2003). Effects of anchor item methods on differential item functioning detection with likelihood ratio test. Applied Psychological Measurement, 27, 479–498. https://doi.org/10.1177/0146621603259902.
Wang, M., & Woods, C. M. (2017). Anchor selection using the Wald test anchor-all-test-all procedure. Applied Psychological Measurement, 41, 17–29. https://doi.org/10.1177/01466216166680|4.
Woods, C. M. (2009a). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42–57. https://doi.org/10.1177/0146621607314044.
Woods, C. M. (2009b). Evaluation of MIMIC-model methods for DIF testing with comparison of two group analysis. Multivariate Behavioral Research, 44, 1–27. https://doi.org/10.1080/00273170802620121.
Woods, C. M. (2011). DIF testing for ordinal items with Poly-SIBTEST, the Mantel and GMH tests and IRTLRDIF when the latent distribution is nonnormal for both groups. Applied Psychological Measurement, 35, 145–164. https://doi.org/10.1177/0146621610377450.
Woods, C. M., Cai, L., & Wang, M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73, 532–547. https://doi.org/10.1177/0013164412464875.
Woods, C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35, 339–361. https://doi.org/10.1177/0146621611405984.
Woods, C. M., & Harpole, J. (2015). How item residual heterogeneity affects tests for differential item functioning. Applied Psychological Measurement, 39, 251–263. https://doi.org/10.1177/0146621614561313.
Yost, K. J., Eton, D. T., Garcia, S. F., & Cella, D. (2011). Minimally important differences were estimated for six PROMIS cancer scales in advanced-stage cancer patients. Journal of Clinical Epidemiology, 64(5), 507–516.
Yu, Q., Medeiros, K. L., Wu, X., & Jensen, R. E. (2018). Nonlinear predictive models for multiple mediation analysis with an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika, 83, 991–1006.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html.
Zwitser, R. J., Glaser, S. F., & Maris, G. (2017). Monitoring countries in a changing world: A new look at DIF in international surveys. Psychometrika, 82(1), 210–232. https://doi.org/10.1007/s11336-016-9543-8.
Funding
U01AR057971 (PIs: Potosky, Moinpour), NCI P30CA051008, UL1TR000101 (previously UL1RR031975) from the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through the Clinical and Translational Science Awards Program (CTSA). Analyses of these data were supported by the Mount Sinai Claude D. Pepper Older Americans Independence Center (National Institute on Aging, 1P30AG028741, Siu) and the Columbia University Alzheimer’s Disease Resource Center for Minority Aging Research (National Institute on Aging, 1P30AG059303, Manly, Luchsinger). This research was also supported by the Eunice Kennedy Shriver National Institutes of Child Health and Human Development of the National Institutes of Health under Award Number R01HD079439 to the Mayo Clinic in Rochester Minnesota through subcontracts to the University of Minnesota and the University of Washington. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors thank Katja Ocepek-Welikson, M.Phil., for analytic assistance and Ruoyi Zhu, a doctoral student in the College of Education, University of Washington for assistance in conducting the simulation study.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Teresi, J.A., Wang, C., Kleinman, M. et al. Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions. Psychometrika 86, 674–711 (2021). https://doi.org/10.1007/s11336-021-09775-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-021-09775-0