Skip to main content
Log in

Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): applications (with illustrations) to measures of physical functioning ability and general distress

  • Original Paper
  • Published:
Quality of Life Research Aims and scope Submit manuscript

Abstract

Background Methods based on item response theory (IRT) that can be used to examine differential item functioning (DIF) are illustrated. An IRT-based approach to the detection of DIF was applied to physical function and general distress item sets. DIF was examined with respect to gender, age and race. The method used for DIF detection was the item response theory log-likelihood ratio (IRTLR) approach. DIF magnitude was measured using the differences in the expected item scores, expressed as the unsigned probability differences, and calculated using the non-compensatory DIF index (NCDIF). Finally, impact was assessed using expected scale scores, expressed as group differences in the total test (measure) response functions. Methods The example for the illustration of the methods came from a study of 1,714 patients with cancer or HIV/AIDS. The measure contained 23 items measuring physical functioning ability and 15 items addressing general distress, scored in the positive direction. Results The substantive findings were of relatively small magnitude DIF. In total, six items showed relatively larger magnitude (expected item score differences greater than the cutoff) of DIF with respect to physical function across the three comparisons: “trouble with a long walk” (race), “vigorous activities” (race, age), “bending, kneeling stooping” (age), “lifting or carrying groceries” (race), “limited in hobbies, leisure” (age), “lack of energy” (race). None of the general distress items evidenced high magnitude DIF; although “worrying about dying” showed some DIF with respect to both age and race, after adjustment. Conclusions The fact that many physical function items showed DIF with respect to age, even after adjustment for multiple comparisons, indicates that the instrument may be performing differently for these groups. While the magnitude and impact of DIF at the item and scale level was minimal, caution should be exercised in the use of subsets of these items, as might occur with selection for clinical decisions or computerized adaptive testing. The issues of selection of anchor items, and of criteria for DIF detection, including the integration of significance and magnitude measures remain as issues requiring investigation. Further research is needed regarding the criteria and guidelines appropriate for DIF detection in the context of health-related items.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Crane, P. K., Gibbons, L. E., Ocepek-Welikson, K., Cook, K., Cella, D., Narasimhalu, K., Hays, R., & Teresi, J. (2007). A comparison of two sets of criteria for determining the presence of differential item functioning using ordinal logistic regression (this issue).

  2. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.

    Article  Google Scholar 

  3. Zumbo, B. D. (1999) A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type(ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Retrieved from http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html.

  4. Crane, P. K., van Belle, G., & Larson, E. B. (2004). Test bias in a cognitive test: Differential item functioning in the CASI. Statistics in Medicine, 23, 241–256.

    Article  PubMed  Google Scholar 

  5. Teresi, J. A., Stewart, A. L., Morales, L., & Stahl, S. (2006). Measurement in a multi-ethnic society. Special Issue of Medical Care, 44(Suppl. 3), S1–S210.

    Google Scholar 

  6. Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, S152–S170.

    Article  PubMed  Google Scholar 

  7. Cole, S. R., Kawachi, I., Maller, S. J., Munoz, R. F., & Berkman, L. F. (2000). Test of item-response bias in the CES-D scale: Experiences from the New Haven EPESE study. Journal of Clinical Epidemiology, 53, 285–289.

    Article  PubMed  CAS  Google Scholar 

  8. Gallo, J. J., Cooper-Patrick, L., & Lesikar, S. (1998). Depressive symptoms of Whites and African-Americans aged 60 years and older. Journal of Gerontology, 53B, 277–285.

    Google Scholar 

  9. Mui, A. C., Burnette, D., & Chen, L. M. (2001). Cross-cultural assessment of geriatric depression: A review of the CES-D and GDS. Journal of Mental Health and Aging, 7, 137–164.

    Google Scholar 

  10. Fleishman, J. A., & Lawrence, W. F. (2003). Demographic variation in SF-12 scores: True differences or differential item functioning? Medical Care, 41(Suppl), 75–86.

    Article  Google Scholar 

  11. Gelin, M. N., Carleton, B. C., Smith, M. A., & Zumbo, B. D. (2004). The dimensionality and gender differential item functioning of the Mini-Asthma Quality-of-Life Questionnaire (MINIAQLQ). Social Indicators Research Dordrecht, 68, 81.

    Google Scholar 

  12. Roorda, L. D., Roebroeck, M. E., van Tilburg, T., Lankhorst, G. J., Bouter L. M., Measuring Mobility Study Group (2004). Measuring activity limitations in climbing stairs: Development of a hierarchical scale for patients with lower-extremity disorders living at home. Archives of Physical Medicine and Rehabilitation, 85, 967–971.

    Google Scholar 

  13. Thissen, D. (1991). MULTILOG TM User’s Guide. Multiple, categorical item analysis and test scoring using item response theory. Chicago: Scientific Software Inc.

    Google Scholar 

  14. Thissen, D. (2001). IRTLRDIF v2.0b; Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Available on Dave Thissen’s web page.

  15. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Sage Publications: Thousand Oaks, California.

    Google Scholar 

  16. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353–368.

    Article  Google Scholar 

  17. Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous DFITframework. Applied Psychological Measurement, 23, 309–326.

    Article  Google Scholar 

  18. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Lawrence Erlbaum Associates: Hillsdale, NJ.

    Google Scholar 

  19. Dorans, N. J., & Kulick, E. (2006). Differential item functioning on the MMSE: An application of the Mantel Haenszel and Standardization procedures. Medical Care, 44 (Suppl. 3), S107–S114.

    Article  Google Scholar 

  20. Simpson, E. H. (1951). The interpretation of interaction contingency tables. Journal of the Royal Statistical Society (Series B), 13, 238–241.

    Google Scholar 

  21. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.

    Article  Google Scholar 

  22. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, California: Sage Publications Inc.

    Google Scholar 

  23. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, New Jersey: Lawrence Erlbaum.

    Google Scholar 

  24. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley Publishing Co.

    Google Scholar 

  25. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 17: Richmond, VA: William Byrd Press.

  26. Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3–62.

    Google Scholar 

  27. Williams, V. S. L., Jones, L. V., & Tukey, J. W. (1999). Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics, 24, 42–69.

    Google Scholar 

  28. Benjamini, Y., & Hochberg, Y. (1995). Controlling for the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289–300.

    Google Scholar 

  29. Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.

    Article  PubMed  CAS  Google Scholar 

  30. Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false discovery rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27, 77–83.

    Article  Google Scholar 

  31. Orlando-Edelen, M., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: Application to the Mini-mental status examination. Medical Care, 44, S134–S142.

    Article  PubMed  Google Scholar 

  32. Wang, W. C., Yeh, Y. L., & Yi, C. (2003). Effects of anchor item methods on differential item functioning detection with likelihood ratio test. Applied Psychological Measurement, 27, 479–498.

    Article  Google Scholar 

  33. Orlando, M., & Marshall, G. N. (2002). Differential item functioning in a Spanish Translation of the PTSD Checklist: Detection and evaluation of impact. Psychological Assessment, 14, 50–59.

    Article  PubMed  Google Scholar 

  34. Teresi, J., Kleinman, M., & Ocepek-Welikson, K. (2000). Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine, 19, 1651–1683.

    Article  PubMed  CAS  Google Scholar 

  35. Chang, H. -H., & Mazzeo, J. (1994). The unique correspondence of the item response function and item category response functions in polytomously scored item response models. Psychometrika, 39, 391–404.

    Article  Google Scholar 

  36. Fleer, P. F. (1993). A Monte Carlo assessment of a new measure of item and test bias. [dissertation] Illinois Institute of Technology. Dissertation Abstracts International 54-04B, 2266.

  37. Flowers, C. P., Oshima, T. C., & Raju, N. S. (1995). A Monte Carlo assessment of DFIT with dichotomously-scored unidimensional tests. [dissertation] Atlanta, GA: Georgia State University.

  38. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 123–135). Hillsdale, NJ: Lawrence Erlbaum Inc.

    Google Scholar 

  39. Raju, N. S. (1999). DFITP5: A Fortran program for calculating dichotomous DIF/DTF [computer program]. Chicago: Illinois Institute of Technology.

    Google Scholar 

  40. Collins, W. C., Raju, N. S., & Edwards, J. E. (2000). Assessing differential item functioning in a satisfaction scale. Journal of Applied Psychology, 85, 451–461.

    Article  PubMed  CAS  Google Scholar 

  41. Morales, L. S., Flowers, C., Gutiérrez, P., Kleinman, M., & Teresi, J. A. (2006). Item and scale differential functioning of the Mini-Mental Status Exam assessed using the DFIT methodology. Medical Care, 44, S143–S151.

    Article  PubMed  Google Scholar 

  42. Baker, F. B. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory [Computer program]. Madison: University of Wisconsin, Laboratory of Experimental Design.

  43. Cohen, A. S., Kim, S.-H., & Baker, F. B. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17, 335–350.

    Google Scholar 

  44. Oshima, T.C., Raju, N.S., Nanda, A.O. (2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43, 1–17.

    Google Scholar 

Download references

Acknowledgements

The authors thank Douglas Holmes, Ph.D. for his review of several versions of this manuscript. The authors also thank three anonymous reviewers and the editor for their helpful comments related to an earlier version of this manuscript. These analyses were conducted on behalf of the Statistical Coordinating Center to the Patient Reported Outcomes Information System (PROMIS) (AR052177). Funding for analyses was provided in part by the National Institute on Aging, Resource Center for Minority Aging Research at Columbia University (AG15294), and by the National Cancer Institute through the Veteran’s Administration Measurement Excellence and Training Resource Information Center (METRIC). An earlier version of this paper was presented at the National Institutes of Health Conference on Patient Reported Outcomes, Bethesda, June, 2004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeanne A. Teresi.

Appendix

Appendix

Illustration: Calculation of Boundary and Category Response Functions, Expected Item and Scale Scores and Non-compensatory Differential Item Functioning (DIF) Indices: Polytomous items with ordinal response categories

This appendix is an illustration of the calculation of several indices used in determining the presence, magnitude and impact of DIF using item response theory. Illustrations can also be found in Collins et al. [40]; Orlando-Edelen et al. [31]; Thissen et al. [38]; and Thissen [13, 14].

Boundary Response Functions: The Samejima [25] graded response model, which assumes ordinal categories can be used to model polytomous items (see also Cohen et al. [43]. The model is based on calculation of a series of cumulative dichotomies resulting in cumulative probabilities of responding in a category or higher. One models the probability that a randomly selected individual with a specific level of physical functioning will respond in category k or higher. The boundary response function defines the cumulative probability of scoring in category k or higher:

$$ \hbox{P}_{\rm ik}(\theta ) = 1/\{1+\exp[-\hbox{a}_{\rm i}(\theta -\hbox{b}_{\rm ik})]\}. $$

There are k−1 such dichotomies. For a three category item, scored 0,1,2: the first cumulative dichotomy is between people who have a zero response vs those who selected response 1 or 2. The second cumulative dichotomy is between those who selected 0 or 1 vs 2 and higher. To illustrate using the example shown in Fig. 2, calculations are presented for θ =  −1.0.

The probability of category 0 or higher = 1 (because every one scores either 0 or higher).

For category 1: P(x = 1 or higher) =  1/{1 + exp[−a(θ −bk1)]}

For Whites: P(x = 1 or higher) =  1/{1 + exp[−3.53((−1)−(−.90))]} =  .4127

For African-Americans: P(x = 1 or higher) =  1/{1 + exp[−2.64((−1)−(−1.19))]} =  .6228

For category 2:

P(x = 2 or higher (there is no higher)) =  1/1 + exp[−a(θ)−bk2)]

For Whites: P(x = 2 or higher) =  1/{1 + exp[−3.53((−1)−(−.07))]} =  .0360

For African-Americans: P(x = 2 or higher) =  1/{1 + exp[−2.64((−1)−(−.01))]} =  .0683

Category response functions: The above formula does not give the probability of responding in a specific category; to get this probability, the adjacent probability is subtracted out. Note that boundary response functions are usually used because they have a consistent form across levels; category response functions do not, and are more difficult to compare and interpret. Note also that when an item is binary, the category response function for the second category (for an item coded 0,1 this is 1) is the same as the item response (boundary response) function.

To obtain the category response P(x = k), subtract out the probability that P is in a higher category: P(k)−P(k+1).

For k = 0   Pi0(θ) = [Pi0(θ)−Pi1(θ)] =  [1−Pi1(θ)]

For k = 1   Pi1(θ) = [Pi1(θ)−Pi2(θ)]

For k = 2   Pi2(θ) = [Pi2(θ)− 0]

For \({\varvec{\theta} {\bf =-1.0:}}\)

For Whites: Pi0 =  1 − .4127 =  .5873

For African-Americans: Pi0 =  1 − .6228 =  .3772

For Whites: Pi1(θ = −1.0) = .4127 − .0360 =  .3767

For African-Americans: Pi1(θ = −1.0) = .6228 − .0683 =  .5545

For Whites: Pi2(θ = −1.0) = .0360

For African-Americans: P i2(θ = −1.0) = .0683

The shapes of the CRFs are not the same, because they are no longer cumulative. A person with a specific theta level will have a separate probability of response for each response category. This category response function provides the probability that a randomly selected individual at say θ = 0 (average ability) will respond in category k.

Computing expected item and test scores: Expected item and test (scale) scores can be computed for both dichotomous and polytomous items. Lord and Novick and Birnbaum [24, p 386] introduce the notion of true scores in the context of IRT. A person’s true score is their expected score, expressed in terms of probabilities for binary items and in terms of weighted probabilities for polytomous items. The test characteristic curve described in Lord and Novick related true score or averaged expected test score to theta.

For a dichotomous item scored 0 and 1, the expected or true score is simply the probability of scoring in the ‘1’ category, given an individual’s estimated ability or Pis), where P i s) is the probability of scoring a ‘1’ on item i for subject s.

$$ \hbox{P}_{\rm i}(\theta_{\rm s}) = 1/\{1+\exp[-\hbox{a}_{\rm i}(\theta _{\rm s}-\hbox{b}_{\rm i})]\}. $$

(Here it is assumed that c (guessing) parameters are estimated at 0).

For a polytomous item, taking a graded response form, the expected score is the sum of the weighted probabilities of scoring in each of the possible categories for the item. For an item with 5 response categories, coded 0 to 4, for example, this sum would be:

$$ 0 \ast[\hbox{P}_{\rm i0}(\theta _{\rm s})] + 1 \ast \hbox{P}_{\rm i1}(\theta _{\rm s}) + 2 \ast \hbox{P}_{\rm i2}(\theta _{\rm s})+ 3 \ast \hbox{P}_{\rm i3}(\theta _{\rm s})+ 4 \ast \hbox{P}_{\rm i4}(\theta _{\rm s}) $$

For the graded response item with k categories, there are k−1 estimated bs and so in the example above, there will be four probabilities computed.

Recall that in the graded response model, the boundary response function defines the cumulative probability of scoring in category k or higher:

$$ \hbox{P}_{\rm ik}(\theta _{\rm s}) = 1/\{1+\exp[-\hbox{a}_{\rm i}(\theta _{\rm s} -\hbox{b}_{\rm ik})]\}. $$

Thus, the probability of scoring above category k must be subtracted out in order to obtain the probability of scoring in the category. (This was illustrated in the previous section.)

Note that Pis) =  1/{1 + exp[−ais−bi1)]} − 1/{1 + exp[−ais−bi2)]}.

As an example, the expected or true score as a member of the African-American group is the sum of the weighted probabilities of scoring in each of the possible categories for each θ, coded 0,1,2.

$$ 0 + 1 \ast \hbox{P}_{\rm i1}(\theta )+ 2 \ast \hbox{P}_{\rm i1}(\theta) $$

A true or expected score for an individual of mild disability (θ = −1.0) as a member of the white group would be:

$$ 0 (.59) + 1(.3767) + 2(.0360) = .4487 $$

A true or expected score for an individual of mild disability (θ = −1.0) as a member of the African-American group would be:

$$ 0 + 1 (.5545) + 2 (.0683) = .6905 $$

The expected test score for a subject with estimated ability θ is simply the sum of the expected item scores for that individual. Plots of expected scores against theta can then be constructed for given values of theta. Individual expected scores are used in the calculation of magnitude and impact indices discussed below.

Computing Non-Compensatory Differential Item Functioning (NCDIF): Two measures developed by Raju and colleagues [16] are based on IRT (see also Flowers et al. [17]). These measures are compensatory and non-compensatory DIF, or CDIF and NCDIF. NCDIF is more like indices of DIF such as the area statistics and Lord’s chi-square; the assumption is that all other items in the test are unbiased, except for the studied item. CDIF does not make this assumption. The advantage of these measures over Lord’s chi-square and the area statistics, such as Raju’s signed and unsigned area statistic is that they are based on the actual distribution of the ability estimates within the group for which it is desired to estimate bias, rather than the entire theoretical range of theta. If, for instance, most members of the focal group fall within the range of theta from −1 to 0, rather than between −1 to +1 on the continuum, the area statistics will give an inaccurate estimate of DIF.

NCDIF is computed exactly like the unsigned probability difference of Camilli and Shepard [15]. For each subject in the focal group, two estimated scores are computed. One is based on the subject’s ability estimate and the estimated a, b and c parameters for the focal group, and the other based on the ability estimate and the estimated a, b and c parameters for the reference group. Each subject’s difference score (d) is squared, and these squared difference scores are added for all subjects (j = 1, n) to obtain NCDIF.

$$ \begin{array}{l} \hbox{NCDIF}_{\rm i} = [\displaystyle \sum_{j=1,n} (\hbox{ES}_{\rm siF }- \hbox{ES}_{\rm siR })^{2 }]\\ \end{array} $$

As an example, NCDIF for item i is the average difference squared between the true or expected scores for an individual (s) as a member of the focal group (F) and as a member of the reference group (R). Using the example shown above for a person at theta=-1.0,

$$ \hbox{d} = (.6905 - .4487)^{2} = .0585 $$

This quantity is then summed across people and averaged ∑d / n. This value is the NCDIF, which (as mentioned above) is also the unsigned probability difference (UPD) illustrated by Camilli and Shepard [15]. NCDIF is the average difference between the true (expected) scores for groups, and provides a measure of DIF magnitude. New methods for determining NCDIF cutoffs for binary items have recently been described [44].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teresi, J.A., Ocepek-Welikson, K., Kleinman, M. et al. Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): applications (with illustrations) to measures of physical functioning ability and general distress. Qual Life Res 16 (Suppl 1), 43–68 (2007). https://doi.org/10.1007/s11136-007-9186-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11136-007-9186-4

Keywords

Navigation