Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Teresi, Jeanne A.; Wang, Chun; Kleinman, Marjorie; Jones, Richard N.; Weiss, David J.

doi:10.1007/s11336-021-09775-0

Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Application Reviews and Case Studies
Published: 12 July 2021

Volume 86, pages 674–711, (2021)
Cite this article

Psychometrika Aims and scope Submit manuscript

Jeanne A. Teresi^1,2,3,4,
Chun Wang⁵,
Marjorie Kleinman⁴,
Richard N. Jones⁶ &
…
David J. Weiss⁷

1888 Accesses
11 Citations
5 Altmetric
Explore all metrics

Abstract

Several methods used to examine differential item functioning (DIF) in Patient-Reported Outcomes Measurement Information System (PROMIS®) measures are presented, including effect size estimation. A summary of factors that may affect DIF detection and challenges encountered in PROMIS DIF analyses, e.g., anchor item selection, is provided. An issue in PROMIS was the potential for inadequately modeled multidimensionality to result in false DIF detection. Section 1 is a presentation of the unidimensional models used by most PROMIS investigators for DIF detection, as well as their multidimensional expansions. Section 2 is an illustration that builds on previous unidimensional analyses of depression and anxiety short-forms to examine DIF detection using a multidimensional item response theory (MIRT) model. The Item Response Theory-Log-likelihood Ratio Test (IRT-LRT) method was used for a real data illustration with gender as the grouping variable. The IRT-LRT DIF detection method is a flexible approach to handle group differences in trait distributions, known as impact in the DIF literature, and was studied with both real data and in simulations to compare the performance of the IRT-LRT method within the unidimensional IRT (UIRT) and MIRT contexts. Additionally, different effect size measures were compared for the data presented in Section 2. A finding from the real data illustration was that using the IRT-LRT method within a MIRT context resulted in more flagged items as compared to using the IRT-LRT method within a UIRT context. The simulations provided some evidence that while unidimensional and multidimensional approaches were similar in terms of Type I error rates, power for DIF detection was greater for the multidimensional approach. Effect size measures presented in Section 1 and applied in Section 2 varied in terms of estimation methods, choice of density function, methods of equating, and anchor item selection. Despite these differences, there was considerable consistency in results, especially for the items showing the largest values. Future work is needed to examine DIF detection in the context of polytomous, multidimensional data. PROMIS standards included incorporation of effect size measures in determining salient DIF. Integrated methods for examining effect size measures in the context of IRT-based DIF detection procedures are still in early stages of development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory

Article Open access 30 July 2019

Angela M. Stover, Lori D. McLeod, … Bryce B. Reeve

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Article 18 June 2021

Jakob Bue Bjorner

Item bias detection in the Hospital Anxiety and Depression Scale using structural equation modeling: comparison with other item bias detection methods

Article Open access 09 December 2016

Mathilde G. E. Verdam, Frans J. Oort & Mirjam A. G. Sprangers

References

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.
Article Google Scholar
Ankenmann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277–300. https://doi.org/10.1111/j.1745-3984.1999.tb00558.x.
Article Google Scholar
Baker, F. B. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory. Madison: University of Wisconsin, Laboratory of Experimental Design.
Google Scholar
Bauer, D., Belzak, W., & Cole, V. (2019). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling A: Multidisciplinary Journal,. https://doi.org/10.1080/10705511.2019.1642754.
Article Google Scholar
Belzak, W., & Bauer, D. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods,. https://doi.org/10.1027/met0000253.
Article PubMed PubMed Central Google Scholar
Benjamini, Y., & Hochberg, Y. (1995). Controlling for the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289–300. https://doi.org/10.2307/2346101.
Article Google Scholar
Bjorner, J. B., Rose, M., Gandek, B., Stone, A. A., Junghaenel, D. U., & Ware, J. E. (2014). Difference in method of administration did not significantly impact item response: An IRT-based analysis from the Patient-Reported Outcomes Measurement Information System (PROMIS) initiative. Quality of Life Research, 23, 217–227.
Article PubMed Google Scholar
Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113–141. https://doi.org/10.1207/S15324818AME1502_01.
Article Google Scholar
Boorsboom, D. (2006). Commentary: When does measurement invariance matter? Medical Care, 44(11), S176–81.
Article Google Scholar
Boorsboom, D., Mellenbergh, G. J., & van Heerdon, J. (2002). Different kinds of DIF: A distinction between absolute and relative forms of measurement invariance and bias. Applied Psychological Measurement, 26, 433–450.
Article Google Scholar
Bulut, O., & Suh, Y. (2017). Detecting multidimensional differential item functioning with the multiple indicators multiple causes model, the item response theory likelihood ratio test, and logistic regression. Frontiers in Education, 2, 51. https://doi.org/10.3389/feduc.2017.00051.
Article Google Scholar
Byrne, B. M., Shavelson, R. J., & Muthén, B. O. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–566. https://doi.org/10.1037/0033-2909.105.3.456.
Article Google Scholar
Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61, 309–329. https://doi.org/10.1348/000711007X249603.
Article PubMed Google Scholar
Cai, L. (2013). FlexMIRT version 2: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.
Google Scholar
Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT Modeling [Computer software]. Lincolnwood, IL: Scientific Software International Inc.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.
Article Google Scholar
Carle, A. C., Cella, D., Cai, L., Choi, S. W., Crane, P. K., Curtis, S. M., et al. (2011). Advancing PROMIS’s methodology: Results of the third PROMIS Psychometric Summit. Expert Review of Pharmacoeconomics & Outcome Research, 11(6), 677–684. https://doi.org/10.1586/erp.11.74.
Article Google Scholar
Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, K., Reeve, B., Ader, D., Fries, J. F., Bruce, B., & Rose, M., on behalf of the PROMIS Cooperative Group. (2007). The patient-reported outcomes measurement information system (PROMIS): Progress of an NIH roadmap cooperative group during its first two years. Medical Care, 45(5 Suppl 1), S3–S11. https://doi.org/10.1097/01.mlr.0000258615.42478.55.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical software, 48(6), 1–29.
Article Google Scholar
Chalmers, R. P. (2016). A differential response functioning framework for understanding item, bundle, and test bias. Doctoral Dissertation, York University, Toronto, Ontario. https://pdfs.semanticscholar.org
Chalmers, R. P. (2018). Model-based measures for detecting and quantifying response bias. Psychometrika, 83, 696–732. https://doi.org/10.1007/s11336-018-9626-9.
Article PubMed Google Scholar
Chalmers, R. P., Counsell, A., & Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76, 114–140.
Article PubMed Google Scholar
Chang, Y.-W., Hsu, N.-J., & Tsai, R.-C. (2017). Unifying differential item functioning in factor analysis for categorical data under a discretization of a normal variant. Psychometrika, 82(2), 382–406. https://doi.org/10.1007/s11336-017-9562-0.
Article PubMed Google Scholar
Chen, J.-H., Chen, C.-T., & Shih, C.-L. (2013). Improving the control of type I error rate in assessing differential item functioning for hierarchical generalized linear models when impact is present. Applied Psychological Measurement, 38, 18–36. https://doi.org/10.1177/0146621613488643.
Article Google Scholar
Cheng, C.-P., Chen, C.-C., & Shih, C.-L. (2020). An exploratory strategy to identify and define sources of differential item functioning. Applied Psychological Measurement, 4, 548–560. https://doi.org/10.1177/014662/620931/90.
Article Google Scholar
Cheng, Y., Shao, C., & Lathrop, Q. N. (2016). The mediated MIMIC model for understanding the underlying mechanisms of DIF. Educational and Psychological Measurement, 76(1), 43–63.
Article PubMed Google Scholar
Cheung, G. W., & Rensvold, R. B. (2003). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. https://doi.org/10.1207/S15328007SEM0902_5.
Article Google Scholar
Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1–30. https://doi.org/10.18637/jss.v039.i08.
Article PubMed PubMed Central Google Scholar
Choi, S. W., Reise, S. P., Pilkonis, P. A., Hays, R. D., & Cella, D. (2010). Efficiency of static and computer adaptive short forms compared to full-length measures of depressive symptoms. Quality of Life Research, 19, 125–136.
Article PubMed Google Scholar
Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1993). The effects of purification of the matching criterion on the identification of DIF using the Mantel–Haenszel procedure. Applied Measurement in Education, 6, 269–279.
Article Google Scholar
Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335–350. https://doi.org/10.1177/014662169301700402.
Article Google Scholar
Cohen, P., Cohen, J., Teresi, J., Marchi, P., & Velez, N. (1990). Problems in the measurement of latent variables in structural equation causal models. Applied Psychological Measurement, 14(2), 183–196. https://doi.org/10.1177/014662169001400207.
Article Google Scholar
Crane, P. K., Gibbons, L. E., Jolley, L., & van Belle, G. (2006). Differential item functioning analysis with ordinal logistic regression techniques: Difdetect and difwithpar. Medical Care, 44, S115–S123. https://doi.org/10.1097/01.mlr.0000245183.28384.ed.
Article PubMed Google Scholar
Crane, P. K., Gibbons, L. E., Ocepek-Welikson, K., Cook, K., Cella, D., & Teresi, J. A. (2007). A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Quality of Life Research, 16, 69–84. https://doi.org/10.1007/s11136-007-9185-5.
Article PubMed Google Scholar
Crane, P. K., van Belle, G., & Larson, E. B. (2004). Test bias in a cognitive test: Differential item functioning in the CASI. Statistics in Medicine, 23, 241–256. https://doi.org/10.1002/sim.1713.
Article PubMed Google Scholar
Culpepper, S. A., Aguinis, H., Kern, J. L., & Millsap, R. (2019). High-stakes testing case study: A latent variable approach for assessing measurement and prediction invariance. Psychometrika, 84, 285–309. https://doi.org/10.1007/s11336-018-9549-2.
Article PubMed Google Scholar
DeMars, C. E. (2010). Type 1 error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70, 961–972. https://doi.org/10.1177/0013164410366691.
Article Google Scholar
DeMars, C. E. (2015). Modeling DIF for simulations: Continuous or categorical secondary trait? Psychological Test and Assessment Modeling, 57, 279–300.
Google Scholar
Edelen, M., Stucky, B., & Chandra, A. (2015). Quantifying “problematic” DIF within an IRT framework: Application to a cancer stigma index. Quality of Life Research, 24, 95–103. https://doi.org/10.1007/s11136-013-0540-4.
Article PubMed Google Scholar
Egberink, I. J. L., Meijer, R. R., & Tendeiro, J. N. (2015). Investigating measurement invariance in computer-based personality testing: The impact of using anchor items on effect size indices. Educational and Psychological Measurement, 75, 126–145. https://doi.org/10.1177/0013164414520965.
Article PubMed Google Scholar
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel–Haenszel, SIBTEST and the IRT likelihood ratio test. Applied Psychological Measurement, 29, 278–295. https://doi.org/10.1177/0146621605275728.
Article Google Scholar
Fleer, P. F. (1993). A Monte Carlo assessment of a new measure of item and test bias (p. 2266, Vol. 54, No. 04B), Illinois Institute of Technology, Dissertation Abstracts International.
Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous DFIT framework. Applied Psychological Measurement, 23, 309–32. https://doi.org/10.1177/01466219922031437.
Article Google Scholar
Furlow, C. F., Ross, T. R., & Gagné, P. (2009). The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Applied Psychological Measurement, 33(6), 441–464. https://doi.org/10.1177/0146621609331959.
Article Google Scholar
Gelin, M. N., & Zumbo, B. D. (2003). Differential item functioning results may change depending on how an item is scored: An illustration with the center for epidemiologic studies depression scale. Educational and Psychological Measurement, 63(1), 65–74. https://doi.org/10.1177/0013164402239317.
Article Google Scholar
González-Betanzos, F., & Abad, F. J. (2012). The effects of purification and the evaluation of differential item functioning with the likelihood ratio test. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 130–145. https://doi.org/10.1027/1614-2241/a000046.
Article Google Scholar
Gómez-Benito, J., Dolores-Hidalgo, M., & Zumbo, B. D. (2013). Effectiveness of combining statistical tests and effect sizes when using logistic discriminant function regression to detect differential item functioning for polytomous items. Educational and Psychological Measurement, 73, 875–897. https://doi.org/10.1177/0013164413492419.
Article Google Scholar
Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups?: Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44(11), S78–S94.
Article PubMed PubMed Central Google Scholar
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications Inc.
Google Scholar
Herrel, F. E. (2009). Design; design package. R package version 2:3.0. Retrieved from http://CRANR-project.org/package=Design
Hidalgo, M. D., Gomez-Benito, J., & Zumbo, B. D. (2014). Binary logistic regression analysis for detecting differential item functioning: Effectiveness of \(\text{ R}^{{2}}\) and delta log odds ratio effect size measures. Educational and Psychological Measurement, 74, 927–949. https://doi.org/10.1177/0013164414523618.
Article Google Scholar
Houts, C. R., & Cai, L. (2013). FlexMIRT user’s manual version 2: Flexible multilevel multidimensional item analysis and test scoring. Chapel Hill, NC: Vector Psychometric Group.
Google Scholar
Jensen, R. E., Moinpour, C. M., Keegan, T. H. M., Cress, R. D., Wu, X.-C., Paddock, L. A., et al. (2016a). The Measuring Your Health Study: Leveraging community-based cancer registry recruitment to establish a large, diverse cohort of cancer survivors for analyses of measurement equivalence and validity of thepatient-reported Outcomes Measurement Information System®(PROMIS®) short form items. Psychological Test and Assessment Modeling, 58(1), 99–117.
Google Scholar
Jensen, R. E., King-Kallimanis, B. L., Sexton, E., Reeve, B. B., Moinpour, C. M., Potosky, A. L., et al. (2016b). Measurement properties of the PROMIS\(^{\textregistered }\) Sleep Disturbance short form in a large, ethnically diverse cancer cohort. Psychological Test and Assessment Modeling, 58(2), 353–370.
Google Scholar
Jin, K. Y., Chen, H. F., & Wang, W. C. (2018). Using odds ratios to detect differential item functioning. Applied Psychological Measurement, 42, 613–29.
Article PubMed PubMed Central Google Scholar
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349. https://doi.org/10.1207/S15324818AME1404_2.
Article Google Scholar
Jones, R. N. (2006). Identification of measurement differences between English and Spanish language versions for the Mini-Mental State Examination: Detecting differential item functioning using MIMIC modeling. Medical Care, 44(11 Suppl 3), S124–S133. https://doi.org/10.1097/01.mlr.0000245250.50114.0f.
Article PubMed Google Scholar
Jones, R. N. (2019). Differential item functioning and its relevance to epidemiology. Current Epidemiology Reports,. https://doi.org/10.1007/s40471-019-00194-5.
Article PubMed PubMed Central Google Scholar
Jones, R. N., Tommet, D., Ramirez, M., Jensen, R. E., & Teresi, J. A. (2016). Differential item functioning in Patient Reported Outcomes Measurement Information System (PROMIS\(^{\textregistered }\)) Physical Functioning short forms: Analyses across ethnically diverse groups. Psychological Test and Assessment Modeling, 58(2), 371–402.
Google Scholar
Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 408–426. https://doi.org/10.1007/BF02291366.
Article Google Scholar
Jöreskog, K., & Goldberger, A. (1975). Estimation of a model of multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 10, 631–639. https://doi.org/10.2307/2285946.
Article Google Scholar
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387. https://doi.org/10.1207/S15327906347-387.
Article PubMed Google Scholar
Jöreskog, K., & Sorbom, D. (1996). LISREL8: Analysis of linear structural relationships: Users Reference Guide. Lincolnwood: Scientific Software International Inc.
Google Scholar
Junker, B. W. (1991). Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika, 56, 255–278. https://doi.org/10.1007/BF02294462.
Article Google Scholar
Kahraman, N., DeBoeck, P., & Janssen, R. (2009). Modeling DIF in complex response data using test design strategies. International Journal of Testing, 8, 151–166. https://doi.org/10.1080/15305050902880744.
Article Google Scholar
Kim, E. S., & Yoon, M. (2011). Testing measurement invariance: A comparison of multiple group categorical CFA and IRT. Structural Equation Modeling, 18, 212–228. https://doi.org/10.1080/10705511-2011.557337.
Article Google Scholar
Kim, E. S., Yoon, M., & Lee, T. (2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72, 469–492. https://doi.org/10.1177/0013164411427395.
Article Google Scholar
Kim, S.-H., & Cohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22, 345–355. https://doi.org/10.1177/014662169802200403.
Article Google Scholar
Kim, S.-H., Cohen, A. S., Alagoz, C., & Kim, S. (2007). DIF detection and effect size measures for polytomously scored items. Journal of Educational Measurement, 44(2), 93–116. https://doi.org/10.1111/j.1745-3984.2007.00029.x.
Article Google Scholar
Kleinman, M., & Teresi, J. A. (2016). Differential item functioning magnitude and impact measures from item response theory models. Psychological Test and Assessment Modeling, 58, 79–98.
PubMed PubMed Central Google Scholar
Kopf, J., Zeileis, A., & Stobl, C. (2015a). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39, 83–103. https://doi.org/10.1177/0146621614544195.
Article PubMed Google Scholar
Kopf, J., Zeileis, A., & Stobl, C. (2015b). Anchor selection strategies for DIF analysis: Review, assessment and new approaches. Educational and Psychological Measurement, 75, 22–56. https://doi.org/10.1177/0013164414529792.
Article PubMed Google Scholar
Langer, M. M. (2008). A re-examination of Lord’s Wald test for differential item functioning using item response theory and modern error estimation (Doctoral dissertation, University of North Carolina at Chapel Hill library). http://search.lib.unc.edu/search?R=UNCb5878458.
Lee, S., Bulut, O., & Suh, Y. (2017). Multidimensional extension of multiple indicators multiple causes models to detect DIF. Educational and Psychological Measurement, 77(4), 545–569.
Article PubMed Google Scholar
Li, Y., Brooks, G. P., & Johanson, G. A. (2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72, 847–861. https://doi.org/10.1177/0013164411432333.
Article Google Scholar
Liu, Y., Magnus, B. E., & Thissen, D. (2016). Modeling and testing differential item functioning in unidimensional binary item response models with a single continuous covariate: A functional data analysis approach. Psychometrika, 81, 371–398.
Article PubMed Google Scholar
Lopez Rivas, G. E., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33, 251–265. https://doi.org/10.1177/0146621608321760.
Article Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Lord, F. M., Novick, M. R., & (with contributions by A. Birnbaum). (1968). Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley Publishing Company Inc.
Mazor, K. M., Hambleton, R. K., & Clauser, B. E. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22, 357–367. https://doi.org/10.1177/014662169802200404.
Article Google Scholar
McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99–114. https://doi.org/10.1177/01466210022031552.
Article Google Scholar
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of IRT and CFA methodologies for establishing measurement equivalence. Organizational Research Methods, 7, 361–388. https://doi.org/10.1177/1094428104268027.
Article Google Scholar
Meade, A., Lautenschlager, G., & Johnson, E. (2007). A Monte Carlo examination of the sensitivity of the differential functioning of items and tests framework for tests of measurement invariance with Likert data. Applied Psychological Measurement, 31, 430–455. https://doi.org/10.1177/0146621606297316.
Article Google Scholar
Meade, A. W., & Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97, 1016–1031. https://doi.org/10.1037/a0027934.
Article PubMed Google Scholar
Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. https://doi.org/10.1016/0883-0355(89)90002-5.
Article Google Scholar
Mellenbergh, G. J. (1994). Generalized linear item response theory. Psychological Bulletin, 115, 302–307. https://doi.org/10.1037/0033-2909.115.2.300.
Article Google Scholar
Meredith, W. (1964). Notes on factorial invariance. Psychometrika, 29, 177–185. https://doi.org/10.1007/BF02289699.
Article Google Scholar
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. https://doi.org/10.1007/BF02294825.
Article Google Scholar
Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical Care, 44(Suppl 3), S69–S77. https://doi.org/10.1097/01.mlr.0000245438.73837.89.
Article PubMed Google Scholar
Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334. https://doi.org/10.1177/014662169301700401.
Article Google Scholar
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195. https://doi.org/10.1007/BF02293979.
Article Google Scholar
Montoya, A. K., & Jeon, M. (2020). MIMIC models for uniform and nonuniform DIF as moderated mediation models. Applied Psychological Measurement, 44(2), 118–136.
Article PubMed Google Scholar
Mukherjee, S., Gibbons, L. E., Kristjansson, E., & Crane, P. K. (2013). Extension of an iterative hybrid ordinal logistic regression/item response theory approach to detect and account for differential item functioning in longitudinal data. Psychological Test and Assessment Modeling, 55(2), 127–147.
PubMed PubMed Central Google Scholar
Muthén, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115–132. https://doi.org/10.1007/BF02294210.
Article Google Scholar
Muthén, B. (1989). Latent variable modeling in heterogeneous populations. Meetings of Psychometric Society (1989, Los Angeles, California and Leuven, Belgium). Psychometrika, 54(4), 557–585.
Article Google Scholar
Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81–117.
Article Google Scholar
Muthén, B., & Asparouhov, T. (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus (p 16). Los Angeles: University of California.
Google Scholar
Muthén, L. K. & Muthén, B. O. (1998–2019). M-PLUS Users Guide. Sixth Edition. Los Angeles, California: Authors Muthén and Muthén.
Muthén, B., du Toit, S.H.C. & Spisic, D. (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. Unpublished Technical Report. Available at https://www.statmodel.com/wlscv.shtml.
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257–274.
Article Google Scholar
Oort, E. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107–124.
Article Google Scholar
Orlando-Edelen, M., Stuckey, B. D., & Chandra, A. (2015). Quantifying ‘problematic’ DIF within an IRT framework: Application to a cancer stigma index. Quality of Life Research, 24, 95–103. https://doi.org/10.1007/s11136-013-0540-4.
Article Google Scholar
Orlando-Edelen, M., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory ad the likelihood-based model comparison approach: Applications to the Mini-Mental State Examination. Medical Care, 44, S134–S142. https://doi.org/10.1097/01.mlr.0000245251.83359.8c.
Article PubMed Google Scholar
Oshima, T. C., Kushubar, S., Scott, J. C., & Raju, N. S. (2009). DFIT8 for Window User’s Manual: Differential functioning of items and tests. St. Paul MN: Assessment Systems Corporation.
Oshima, T. C., Raju, N. S., & Nanda, A. O. (2006). A new method for assessing the statistical significance of the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43, 1–17. https://doi.org/10.1111/j.1745-3984.2006.00001.x.
Article Google Scholar
Paz, S. H., Spritzer, K. L., Morales, L., & Hays, R. D. (2013). Evaluation of the Patient-Reported outcomes Information System (PROMIS) Spanish-language physical functioning items. Quality of Life Research, 22, 1819–1830. https://doi.org/10.1007/s11136-012-0292-6.
Article PubMed Google Scholar
Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., & Cella, D. (2011). Item banks for measuring emotional distress from the patient-reported outcomes measurement information system (PROMIS): Depression, Anxiety and Anger. Assessment, 18, 263–283.
Article PubMed PubMed Central Google Scholar
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502. https://doi.org/10.1007/BF02294403.
Article Google Scholar
Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207. https://doi.org/10.1177/014662169001400208.
Article Google Scholar
Raju, N. S. (1999). DFITP5: A Fortran program for calculating dichotomous DIF/DTF [Computer program]. Chicago: Illinois Institute of Technology.
Google Scholar
Raju, N. S., Fortmann-Johnson, K. A., Kim, W., Morris, S. B., Nering, M. L., & Oshima, T. C. (2009). The item parameter replication method for detecting differential functioning in the polytomous DFIT framework. Applied Psychological Measurement, 33, 133–147. https://doi.org/10.1177/0146621608319514.
Article Google Scholar
Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–528. https://doi.org/10.1037//0021-9010.87.3.517.
Article PubMed Google Scholar
Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368. https://doi.org/10.1177/014662169501900405.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: DenmarksPaedagogiskeInstitut (Danish Institute of Educational Research).
Raykov, T., Marcoulides, G. A., Menold, N., & Harrison, M. (2019). Revisiting the bi-factor model: Can mixture modeling help assess its applicability? Structural Equation Modeling, 26, 110–118.
Article Google Scholar
Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15, 361–373.
Article Google Scholar
Reeve, B. B., Hays, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., et al. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcome Measurement Information System (PROMIS). Medical Care, 45(5 Suppl 1), S22–S31. https://doi.org/10.1097/01.mlr.0000250483.85507.04.
Article PubMed Google Scholar
Reeve, B. B., & Teresi, J. A. (2016). Overview to the two-part series: Measurement equivalence of the Patient Reported Outcomes Measurement Information System\(^{@}\) (PROMIS)\(^{@}\) short forms. Psychological Test and Assessment Modeling, 58(1), 31–35.
Google Scholar
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667–696. https://doi.org/10.1080/00273171.2012.715555.
Article PubMed PubMed Central Google Scholar
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566. https://doi.org/10.1037/0033-2909.114.3.552.
Article PubMed Google Scholar
Rikis, D. R. J., & Oshima, T. C. (2017). Effect of purification procedures on DIF analysis in IRTPRO. Educational and Psychological Measurement, 77, 415–428.
Article Google Scholar
Rizopoulus, D. (2006). Ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17, 1–25. https://doi.org/10.18637/jss.v017.i05.
Article Google Scholar
Rizopoulus, D. (2009). Ltm: Latent Trait Models under IRT. http://cran.rproject.org/web/packages/ltm/index.html.
Rouquette, A., Hardouin, J. B., Vanhaesebrouck, A., Véronique Sébille, V., & Coste, J. (2019). Differential item functioning (DIF) in composite health measurement scale: Recommendations for characterizing DIF with meaningful consequences within the Rasch model framework. PLoS ONE, 14(4), e0215073. https://doi.org/10.1371/journal.pone.0215073.
Article PubMed PubMed Central Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 100–114. https://doi.org/10.1007/BF02290599.
Article Google Scholar
Schalet, B. D., Pilkonis, P. A., Yu, L., Dodds, N., Johnston, K. L., Yount, S., et al. (2016). Clinical validity of PROMIS depression, anxiety and anger across diverse clinical groups. Journal of Clinical Epidemiology, 73, 119–127. https://doi.org/10.1016/j.jclinepi2015.08.036.
Article PubMed PubMed Central Google Scholar
Setodji, C. M., Reise, S. P., Morales, L. S., Fongwam, N., & Hays, R. D. (2011). Differential item functioning by survey language among older Hispanics enrolled in Medicare Managed Care a new method for anchor item selection. Medical Care, 49, 461–468. https://doi.org/10.1097/MLR.0b013e318207edb5.
Article PubMed PubMed Central Google Scholar
Seybert, J., & Stark, S. (2012). Iterative linking with the differential functioning of items and tests (DFIT) Method: Comparison of testwide and item parameter replication (IPR) critical values. Applied Psychological Measurement, 36, 494–515. https://doi.org/10.1177/0146621612445182.
Shealy, R. T., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–194.
Article Google Scholar
Shih, C.-L., Liu, T.-H., & Wang, W.-C. (2014). Controlling type 1 error rates in assessing DIF for logistic regression method with SIBTEST regression correction procedure and DIF-free-then-DIF strategy. Educational and Psychological Measurement, 74, 1018–1048. https://doi.org/10.1177/0013164413520545.
Article Google Scholar
Shih, C.-L., & Wang, W.-C. (2009). Differential item functioning detection using multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33, 184–199. https://doi.org/10.1177/0146621608321758.
Article Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497–508. https://doi.org/10.1037/0021-9010.89.3.497.
Article PubMed Google Scholar
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292.
Article PubMed Google Scholar
Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415. https://doi.org/10.1007/s11136-011-9969-5.
Article PubMed Google Scholar
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
Article Google Scholar
Stout, W. F. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika, 52, 589–617.
Article Google Scholar
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensional assessment and ability estimation. Psychometrika, 55, 293–326.
Article Google Scholar
Stout, W., Li, H., Nandakumar, R., & Bolt, D. (1997). MULTISIB—A procedure to investigate DIF when a test is intentionally multidimensional. Applied Psychological Measurement, 21, 195–213.
Article Google Scholar
Strobl, C., Kopf, J., & Zeileis, A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80, 289–316. https://doi.org/10.1007/s11366-013-9388-3.
Article PubMed Google Scholar
Suh, Y., & Cho, S.-J. (2014). Chi-square difference tests for detecting differential functioning in a multidimensional IRT model: A Monte Carlo study. Applied Psychological Measurement, 38(5), 359–375.
Article Google Scholar
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.
Article Google Scholar
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408. https://doi.org/10.1007/BF02294363.
Article Google Scholar
Taple, B. J., Griffith, J. W., & Wolf, M. S. (2019). Interview administration of PROMIS depression and anxiety short forms. Health Literacy Research Practice, 6, e196–e204. https://doi.org/10.3928/24748307-20190626-01.
Article Google Scholar
Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44(Suppl. 11), S152–S170. https://doi.org/10.1097/01.mlr.0000245142.74628.ab.
Article PubMed Google Scholar
Teresi, J. A. (2019). Applying and Acting on DIF. Moderator at the 2019 PROMIS Psychometric Summit, Northwestern University, Chicago, IL.
Teresi, J. A. & Jones, R. N. (2013). Bias in psychological assessment and other measures. In K. F. Geisinger (Ed.), APA Handbook of Testing and Assessment in Psychology: Vol 1. Test Theory and Testing and Assessment in Industrial and Organizational Psychology (pp. 139–164). American Psychological Association: Washington, DC. https://doi.org/10.1037/14047-008.
Teresi, J. A., & Jones, R. N. (2016). Methodological issues in examining measurement equivalence in patient reported outcomes measures: Methods overview to the two-part series, “Measurement Equivalence of the Patient Reported Outcomes Measurement Information System (PROMIS) Short Form Measures”. Psychological Test and Assessment Modeling, 58(1), 37–78.
PubMed PubMed Central Google Scholar
Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2000). Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine, 19, 1651–1683.
Article PubMed Google Scholar
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Cook, K. F., Crane, P. K., Gibbons, L. E., et al. (2007). Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): Applications (with illustrations) to measure of physical functioning ability and general distress. Quality Life Research, 16, 43–68. https://doi.org/10.1007/s11136-007-9186-4.
Article Google Scholar
Teresi, J., Ocepek-Welikson, K., Kleinman, M., Eimicke, J. E., Crane, P. K., Jones, R. N., et al. (2009). Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychology Science Quarterly, 51(2), 148–180. PMCID: PMC2844669. NIHMSID: 136951.
PubMed PubMed Central Google Scholar
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Ramirez, M., & Kim, G. (2016a). Psychometric properties and performance of the Patient Reported Outcomes Measurement Information System®(PROMIS®) depression short forms in ethnically diverse groups. Psychological Test and Assessment Modeling, 58(1), 141–181.
PubMed PubMed Central Google Scholar
Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Ramirez, M., & Kim, G. (2016b). Measurement equivalence of the Patient Reported Outcomes Measurement Information System®(PROMIS®) anxiety short forms in ethnically diverse groups. Psychological Test and Assessment Modeling, 58(1), 183–219.
PubMed PubMed Central Google Scholar
Teresi, J. A., Ramirez, M., Jones, R. N., Choi, S., & Crane, P. K. (2012). Modifying measures based on Differential Item Functioning (DIF) impact analyses. Journal of Aging & Health, 24(6), 1044–1076. https://doi.org/10.1177/089826412436877.
Article Google Scholar
Teresi, J. A., & Reeve, B. B. (2016). Epilogue to the two-part series: Measurement equivalence of the Patient Reported Outcomes Measurement Information System (PROMIS) short forms. Psychological Tests and Assessment Modeling, 58(2), 423–433.
Google Scholar
Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the Computation of the Statistics Involved in Item Response Theory Likelihood Ratio Tests for Differential Item Functioning. Unpublished manual from the L.L. Thurstone Psychometric Laboratory: University of North Carolina at Chapel Hill.
Thissen, D. (1991). MULTILOG\(^{{\rm TM}}\)user’s guide multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software Inc.
Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini–Hochberg procedure for controlling the false discovery rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27, 77–83. https://doi.org/10.3102/10769986027001077.
Article Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–169). Hillsdale, New Jersey: Lawrence Erlbaum, Associates.
Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Inc.
Google Scholar
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. https://doi.org/10.1177/109442810031002.
Article Google Scholar
Wainer, H. (1993). Model-based standardization measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning (pp. 123–135). Hillsdale NJ: Lawrence Erlbaum Inc.
Google Scholar
Wang, T., Strobl, C., Zeileis, A., & Merkle, E. C. (2018). Score-based test of differential item functioning via pairwise maximum likelihood estimation. Psychometrika, 83, 132–135. https://doi.org/10.1007/s11336-017-9591-8.
Article PubMed Google Scholar
Wang, W. (2004). Effects of anchor item methods on detection of differential item functioning within the family of Rasch models. Journal of Experimental Education, 72, 221–261. https://doi.org/10.3200/JEXE.72.3.221-261.
Article Google Scholar
Wang, W.-C., & Shih, C.-L. (2010). MIMIC methods for assessing differential item functioning in polytomous items. Applied Psychological Measurement, 34, 166–180. https://doi.org/10.1177/0146621609355279.
Article Google Scholar
Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then DIF strategy for the assessment of differential item functioning (DIF). Educational and Psychological Measurement, 72, 687–708. https://doi.org/10.1177/0013164411426157.
Article Google Scholar
Wang, W.-C., Shih, C.-L., & Yang, C.-C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69, 713–731. https://doi.org/10.1177/0013164409332228.
Article Google Scholar
Wang, W. C., & Yeh, Y. L. (2003). Effects of anchor item methods on differential item functioning detection with likelihood ratio test. Applied Psychological Measurement, 27, 479–498. https://doi.org/10.1177/0146621603259902.
Article Google Scholar
Wang, M., & Woods, C. M. (2017). Anchor selection using the Wald test anchor-all-test-all procedure. Applied Psychological Measurement, 41, 17–29. https://doi.org/10.1177/01466216166680|4.
Article PubMed Google Scholar
Woods, C. M. (2009a). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42–57. https://doi.org/10.1177/0146621607314044.
Article Google Scholar
Woods, C. M. (2009b). Evaluation of MIMIC-model methods for DIF testing with comparison of two group analysis. Multivariate Behavioral Research, 44, 1–27. https://doi.org/10.1080/00273170802620121.
Article PubMed Google Scholar
Woods, C. M. (2011). DIF testing for ordinal items with Poly-SIBTEST, the Mantel and GMH tests and IRTLRDIF when the latent distribution is nonnormal for both groups. Applied Psychological Measurement, 35, 145–164. https://doi.org/10.1177/0146621610377450.
Article Google Scholar
Woods, C. M., Cai, L., & Wang, M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73, 532–547. https://doi.org/10.1177/0013164412464875.
Article Google Scholar
Woods, C. M., & Grimm, K. J. (2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35, 339–361. https://doi.org/10.1177/0146621611405984.
Article Google Scholar
Woods, C. M., & Harpole, J. (2015). How item residual heterogeneity affects tests for differential item functioning. Applied Psychological Measurement, 39, 251–263. https://doi.org/10.1177/0146621614561313.
Article PubMed Google Scholar
Yost, K. J., Eton, D. T., Garcia, S. F., & Cella, D. (2011). Minimally important differences were estimated for six PROMIS cancer scales in advanced-stage cancer patients. Journal of Clinical Epidemiology, 64(5), 507–516.
Article PubMed PubMed Central Google Scholar
Yu, Q., Medeiros, K. L., Wu, X., & Jensen, R. E. (2018). Nonlinear predictive models for multiple mediation analysis with an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika, 83, 991–1006.
Article PubMed PubMed Central Google Scholar
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html.
Zwitser, R. J., Glaser, S. F., & Maris, G. (2017). Monitoring countries in a changing world: A new look at DIF in international surveys. Psychometrika, 82(1), 210–232. https://doi.org/10.1007/s11336-016-9543-8.
Article PubMed Google Scholar

Download references

Funding

U01AR057971 (PIs: Potosky, Moinpour), NCI P30CA051008, UL1TR000101 (previously UL1RR031975) from the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through the Clinical and Translational Science Awards Program (CTSA). Analyses of these data were supported by the Mount Sinai Claude D. Pepper Older Americans Independence Center (National Institute on Aging, 1P30AG028741, Siu) and the Columbia University Alzheimer’s Disease Resource Center for Minority Aging Research (National Institute on Aging, 1P30AG059303, Manly, Luchsinger). This research was also supported by the Eunice Kennedy Shriver National Institutes of Child Health and Human Development of the National Institutes of Health under Award Number R01HD079439 to the Mayo Clinic in Rochester Minnesota through subcontracts to the University of Minnesota and the University of Washington. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors thank Katja Ocepek-Welikson, M.Phil., for analytic assistance and Ruoyi Zhu, a doctoral student in the College of Education, University of Washington for assistance in conducting the simulation study.

Author information

Authors and Affiliations

Columbia University Stroud Center, New York, NY, USA
Jeanne A. Teresi
Research Division, Hebrew Home at Riverdale; RiverSpring Health, Bronx, NY, USA
Jeanne A. Teresi
Department of Geriatrics and Palliative Medicine, Weill Cornell Medical Center, New York, NY, USA
Jeanne A. Teresi
New York State Psychiatric Institute, New York, NY, USA
Jeanne A. Teresi & Marjorie Kleinman
Center for Statistics and the Social Sciences (Affiliate), University of Washington College of Education, Seattle, WA, USA
Chun Wang
Department of Psychiatry and Human Behavior, Warren Alpert Medical School, Brown University, Providence, RI, USA
Richard N. Jones
University of Minnesota, Minneapolis, MN, USA
David J. Weiss

Authors

Jeanne A. Teresi
View author publications
You can also search for this author in PubMed Google Scholar
Chun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Marjorie Kleinman
View author publications
You can also search for this author in PubMed Google Scholar
Richard N. Jones
View author publications
You can also search for this author in PubMed Google Scholar
David J. Weiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeanne A. Teresi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teresi, J.A., Wang, C., Kleinman, M. et al. Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions. Psychometrika 86, 674–711 (2021). https://doi.org/10.1007/s11336-021-09775-0

Download citation

Received: 11 June 2020
Revised: 02 March 2021
Accepted: 19 May 2021
Published: 12 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11336-021-09775-0

Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions

Abstract

Access this article

Similar content being viewed by others

State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Item bias detection in the Hospital Anxiety and Depression Scale using structural equation modeling: comparison with other item bias detection methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory

Solving the Tower of Babel Problem for Patient-Reported Outcome Measures

Item bias detection in the Hospital Anxiety and Depression Scale using structural equation modeling: comparison with other item bias detection methods

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation