Quality of Life Research

, Volume 16, Supplement 1, pp 33–42

Differential item functioning and health assessment

Original Paper


Establishing measurement equivalence is important because inaccurate assessment may lead to incorrect estimates of effects in research, and to suboptimal decisions at the individual, clinical level. Examination of differential item functioning (DIF) is a method for studying measurement equivalence. An item (i.e., one question in a longer scale) exhibits DIF if the item response differs across groups (e.g., gender, race), controlling for an estimate of the construct being measured. A distinction between applications in health, as contrasted with other settings such as educational and aptitude testing, is that there are many health-related constructs and multiple measures of each, few of which have received much critical evaluation. Discussed in this article are several methods for detection of differential item functioning (DIF), including non-parametric and parametric methods such as logistic regression, and those based on item response theory. Basic definitions and criteria for DIF detection are provided, as are steps in performing the analyses. Recommendations are presented and future directions discussed.


Differential item functioning Measurement equivalence Health 


  1. 1.
    Crane, P. K., Gibbons, L. E., Ocepek-Welikson, K., Cook, K., Cella, D., Narasimhalu, K., Hays, R., & Teresi, J. A comparison of two sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Quality of Life Research, this issue.Google Scholar
  2. 2.
    Teresi, J. A., Ocepek-Welikson, K., Kleinman, M., Cook, K. F., Crane, P., Gibbons, L. E., Morales, L. S., Orlando-Edelen, M., & Cella, D. Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF):Applications (with illustrations) to measures of physical functioning ability and general distress. Quality of Life Research, this issue.Google Scholar
  3. 3.
    Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, California: Sage Publications.Google Scholar
  4. 4.
    Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  5. 5.
    Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.CrossRefGoogle Scholar
  6. 6.
    Potenza, M. T., & Dorans, N. J. (1995). DIF assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23–37.CrossRefGoogle Scholar
  7. 7.
    Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale NJ: Lawrence Erlbaum, Inc.Google Scholar
  8. 8.
    Teresi, J. A. (2001). Statistical methods for examination of differential item functioning (DIF) with applications to cross-cultural measurement of functional, physical and mental health. Journal of Mental Health and Aging, 7, 31–40.Google Scholar
  9. 9.
    Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, S152–S170.PubMedCrossRefGoogle Scholar
  10. 10.
    King, G., Murray, C. J. L., Salomon, J. A., & Tandon, A. (2004). Enhancing the validity and cross-cultural comparability of measurement in survey research. American Political Science Review, 98, 191–207.CrossRefGoogle Scholar
  11. 11.
    Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44(Suppl 3), S182–S188.PubMedCrossRefGoogle Scholar
  12. 12.
    Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2000). Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine, 19, 1651–1683.PubMedCrossRefGoogle Scholar
  13. 13.
    Fleishman, J. A., & Lawrence, W. F. (2003) Demographic variation in SF-12 scores: True differences or differential item functioning? Medical Care, 41(Suppl. 7), III75–III86.PubMedGoogle Scholar
  14. 14.
    Fleishman, J. A., Spector, W. D., & Altman, B. M. (2002). Impact of differential item functioning on age and gender differences in functional disability. Journal of Gerontology: Social Sciences, 57B, S275–S284.Google Scholar
  15. 15.
    Orlando-Edelen, M., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: Application to the Mini-mental status examination. Medical Care, 44, S134–S142.PubMedCrossRefGoogle Scholar
  16. 16.
    Morales, L. S., Flowers, C., Gutiérrez, P., Kleinman, M., & Teresi, J. A. (2006). Item and scale differential functioning of the Mini-Mental Status Exam assessed using the DFIT methodology. Medical Care, 44, S143–S151.PubMedCrossRefGoogle Scholar
  17. 17.
    Crane, P. K., Gibbons, L. E., Jolley, L., & van Belle, G. (2006). DIF analysis with ordinal logistic regression techniques: DIFDETECT. Medical Care, 44(Suppl3), S115–S123.PubMedCrossRefGoogle Scholar
  18. 18.
    Mantel, N., & Haenszel, W. M. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.PubMedGoogle Scholar
  19. 19.
    Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & J. I. Braun (Eds.), Test validity. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  20. 20.
    Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368.CrossRefGoogle Scholar
  21. 21.
    Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–194.CrossRefGoogle Scholar
  22. 22.
    Dorans, N. J., & Kulick, E. (2006). Differential item functioning on the Mini-Mental State Examination: An application of Mantel-Haenszel and standardization procedures. Medical Care, 44(Suppl. 3), S107–S114.PubMedCrossRefGoogle Scholar
  23. 23.
    Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel type I error performance. Journal of Educational Measurement, 33, 215–230.CrossRefGoogle Scholar
  24. 24.
    Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 26, 361–370.CrossRefGoogle Scholar
  25. 25.
    Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type(ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html.
  26. 26.
    Crane, P. K., van Belle G, & Larson, E. B. (2004) Test bias in a cognitive test: Differential item functioning in the CASI. Statistics in Medicine, 23, 241–256.PubMedCrossRefGoogle Scholar
  27. 27.
    Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.CrossRefGoogle Scholar
  28. 28.
    Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading Massachusetts: Addison-Wesley Publishing Co.Google Scholar
  29. 29.
    Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale New Jersey: Lawrence Erlbaum.Google Scholar
  30. 30.
    Hambleton, R. K., Swaminathan, H, & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications, Inc.Google Scholar
  31. 31.
    Thissen, D. (1991). MULTILOGTM User’s guide. Multiple, categorical item analysis and test scoring using Item response theory. Chicago: Scientific Software, Inc.Google Scholar
  32. 32.
    Thissen, D. (2001). IRTLRDIF v2.0b; Software for the Computation of the Statistics Involved in Item Response Theory Likelihood-Ratio Tests for Differential Item Functioning. Available on Dave Thissen’s web page.Google Scholar
  33. 33.
    Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368.CrossRefGoogle Scholar
  34. 34.
    Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous DFIT framework. Applied Psychological Measurement, 23, 309–326.CrossRefGoogle Scholar
  35. 35.
    Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81–117.CrossRefGoogle Scholar
  36. 36.
    Muthén, L. K., & Muthén, B. O. (2004). MPLUS Statistical Analysis with latent variables. Users guide. Los Angeles, California: Muthén and Muthén.Google Scholar
  37. 37.
    Meredith, W. (1964). Notes on factorial invariance. Psychometricka, 29, 177–185.CrossRefGoogle Scholar
  38. 38.
    Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups? Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44(Suppl 3), S78–S94.PubMedCrossRefGoogle Scholar
  39. 39.
    Jones, R. N., & Gallo, J. J. (2002). Education and sex differences in the Mini-Mental State Examination: Effects of differential item functioning. Journal of Gerontology: Psychological Sciences, 57B, P548–P558.Google Scholar
  40. 40.
    Jones, R. N. (2006). Identification of measurement differences between English and Spanish language versions of the Mini-mental State Examination: Detecting differential item functioning using MIMIC modeling. Medical Care, 44(Suppl 3), S124–S133.PubMedCrossRefGoogle Scholar
  41. 41.
    Krause, N. (2002). A comprehensive strategy for developing closed-ended survey items for use in studies of older adults. Journal of Gerontology B Psychological Sciences, 57B, S263–S274.Google Scholar
  42. 42.
    Allalouf, A., Hambleton, R., & Sireci, S. (1999). Identifying the causes of translation DIF on verbal items. Journal of Educational Measurement, 36, 185–198.CrossRefGoogle Scholar
  43. 43.
    Gierl, M. J., & Khaliq, S. N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests: A confirmatory analysis. Journal of Educational Measurement, 38, 164–187.CrossRefGoogle Scholar
  44. 44.
    Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355–371.CrossRefGoogle Scholar
  45. 45.
    Nápoles-Springer, A. M., Santoyo-Olsson, J., O’Brien, H., & Stewart, A. L. (2006). Using cognitive interviews to develop surveys in diverse populations. Medical Care, 44(Suppl 3), S21–S30.PubMedCrossRefGoogle Scholar
  46. 46.
    McHorney, C. A. (2003). Ten recommendations for advancing patient-centered outcomes measurement for older persons. Annals of Internal Medicine, 139, 403–409.PubMedGoogle Scholar
  47. 47.
    Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Psychological Measurement, 15, 113–141.Google Scholar
  48. 48.
    Wainer, H. (1993). Model-based standardized measurement of an item’s differential impact. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 123–135). Hillsdale NJ: Lawrence Erlbaum, Inc.Google Scholar
  49. 49.
    Nandakumar R., & Roussos L. (in press) Evaluation of CATSIB procedure in pretest setting. Journal of Educational and Behavioral Statistics. Google Scholar
  50. 50.
    Zwick, R., Thayer, D. T., & Wingersky, M. (1994). A simulation study of methods for assessing differential item functioning in computerized adaptive tests. Applied Psychological Measurement, 18, 121–140.CrossRefGoogle Scholar
  51. 51.
    Teresi, J. A., Holmes, D., Ramirez, M., Gurland, B. J., & Lantigua, R. (2001). Performance of cognitive tests among different racial/ethnic groups: Findings of differential item functioning and possible item bias. Journal of Mental Health and Aging, 7, 79–89.Google Scholar
  52. 52.
    Teresi, J., Cross, P., & Golden, R. (1989). Some applications of latent trait analysis to the measurement of ADL. Journal of Gerontology: Social Sciences, 44, S196–S204.Google Scholar
  53. 53.
    Morales, L. S., Reise, S. P., & Hays, R. D. (2000). Evaluating the equivalence of health care ratings by whites and hispanics. Medical Care, 38, 517–527.PubMedCrossRefGoogle Scholar
  54. 54.
    Orlando, M., & Marshall, G. N. (2002) Differential item functioning in a Spanish translation of the PTSD Checklist: Detection and evaluation of impact. Psychological Assessment, 14, 50–59.PubMedCrossRefGoogle Scholar
  55. 55.
    National Research Council. (2004). Measuring racial discrimination. Panel on methods for assessing discrimination. Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington DC: The National Academies Press.Google Scholar
  56. 56.
    Johanson, G., & Alsmadi, A. (2002). Differential person functioning. Educational and Psychological Measurement, 62, 435–443.CrossRefGoogle Scholar
  57. 57.
    Hunter, J. E., & Schmidt, F. L. (2000) Racial and gender bias in ability and achievement tests. Psychology, Public Policy and Law, 6, 151–158.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  1. 1.Research DivisionHebrew Home for the Aged at RiverdaleRiverdaleUSA
  2. 2.Center for Financing, Access and Cost TrendsAgency for Healthcare Research and QualityRockvilleUSA
  3. 3.Columbia University Stroud Center and Faculty of MedicineNew York State Psychiatric InstituteNew YorkUSA

Personalised recommendations