Skip to main content
Log in

Differential item functioning was negligible in an adaptive test of functional status for patients with knee impairments who spoke English or Hebrew

  • Published:
Quality of Life Research Aims and scope Submit manuscript

Abstract

Objective

We examined the presence and impact of differential item functioning (DIF) in a set of knee-specific functional status (FS) items administered using computerized adaptive testing (CAT) among English (United States) and Hebrew (Israel) speaking patients receiving therapy for knee impairments. DIF occurs in an item if probabilities of endorsing responses differ across groups after controlling for the FS measured.

Methods

We analyzed data from 28,320 patients (14,160 U.S., 14,160 Israel) who completed the knee-specific CAT. Items were assessed for DIF by gender, age, symptom acuity, surgical history, exercise history, and language spoken using a hybrid technique that combines multiple ordinal logistic regression and item response theory FS estimates.

Results

Several items had non-uniform DIF for covariates including language, but unadjusted and DIF-adjusted functional status estimates were in strong concordance [ICC(2,1) values ≥0.97], and differences between unadjusted and adjusted FS scores represented <0.4% of the unadjusted FS standard deviation.

Conclusions

Statistically significant DIF was identified in some items but represented negligible clinical impact. Results suggested no need to adjust items for DIF when assessing FS outcomes across groups of patients with knee impairments who answer the knee CAT items in English in the United States or Hebrew in Israel. These findings suggest negligible differences in cultural perceptions between English and Hebrew wording of these knee-specific CAT FS items.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Abbreviations

CAT:

Computerized adaptive testing

CI:

Confidence interval

df :

Degrees of freedom

DIF:

Differential item functioning

FOTO:

Focus on therapeutic outcomes, Inc.

FS:

Functional status

GRM:

Graded response model

ICC:

Intraclass correlational coefficient

IER:

Item exposure rate

IRT:

Item response theory

P :

Probability

PRO:

Patient-reported outcomes

RSM:

Rating scale model

SD:

Standard deviation

SE:

Standard error

T :

t-test

References

  1. American Physical Therapy Association. (2001). Guide to physical therapist practice. Physical Therapy, 81(1), 1–768.

    Google Scholar 

  2. Swinkels, I. C., Hart, D. L., Deutscher, D., van den Bosch, W. J., Dekker, J., de Bakker, D. H., et al. (2008). Comparing patient characteristics and treatment processes in patients receiving physical therapy in the United States, Israel and the Netherlands. Cross sectional analyses of data from three clinical databases. BMC Health Services Research, 8(1), 163.

    Article  PubMed  Google Scholar 

  3. Carter, S. K., & Rizzo, J. A. (2007). Use of outpatient physical therapy services by people with musculoskeletal conditions. Physical Therapy, 87(5), 497–512.

    PubMed  Google Scholar 

  4. MedPAC. (2006). Toward better value in purchasing outpatient therapy services, (Chap. 6). Report to the Congress: Increasing the value of medicare. Washington, DC: Medicare Payment Advisory Committee, pp. 117–141.

  5. World Confederation for Physical Therapy. (2007). Declarations of Principle and Position Statements. Available at: www.wcpt.org/policies/principles/index.php. Accessed August 11, 2008.

  6. Hahn, E. A., Holzner, B., Kemmler, G., Sperner-Unterweger, B., Hudgens, S. A., & Cella, D. (2005). Cross-cultural evaluation of health status using item response theory: FACT-B comparisons between Austrian and U.S. patients with breast cancer. Evaluation & the Health Professions, 28(2), 233–259.

    Article  Google Scholar 

  7. Tennant, A., Penta, M., Tesio, L., Grimby, G., Thonnard, J. L., Slade, A., et al. (2004). Assessing and adjusting for cross-cultural validity of impairment and activity limitation scales through differential item functioning within the framework of the Rasch model: The PRO-ESOR project. Medical Care, 42(1 Suppl), I37–I48.

    PubMed  Google Scholar 

  8. Petersen, M. A., Groenvold, M., Bjorner, J. B., Aaronson, N., Conroy, T., Cull, A., et al. (2003). Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire. Quality of Life Research, 12(4), 373–385.

    Article  PubMed  Google Scholar 

  9. Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory and health outcomes measurement in the 21st century. Medical Care, 38(9 Suppl), II28–II42.

    PubMed  CAS  Google Scholar 

  10. Patrick, D. L., & Chiang, Y. P. (2000). Convening health outcomes methodologists. Medical Care, 38(9 Suppl), II3–II6.

    PubMed  CAS  Google Scholar 

  11. van der Linden, W., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York, NY: Springer.

    Google Scholar 

  12. Hart, D. L., Mioduski, J. E., & Stratford, P. W. (2005). Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. Journal of Clinical Epidemiology, 58(6), 629–638.

    Article  PubMed  Google Scholar 

  13. Hart, D. L., Wang, Y. C., Stratford, P. W., & Mioduski, J. E. (2008). Computerized adaptive test for patients with knee impairments produced valid and responsive measures of function. Journal of Clinical Epidemiology, 61(11), 1113–1124.

    Article  PubMed  Google Scholar 

  14. Wainer, H. (Ed.). (2000). Computerized adaptive testing. A primer (2nd ed.). Mahway, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  15. Hart, D. L., Wang, Y. C., Stratford, P. W., & Mioduski, J. E. (2008). Computerized adaptive test for patients with foot or ankle impairments produced valid and responsive measures of function. Quality of Life Research, 17(8), 1081–1091.

    Article  PubMed  Google Scholar 

  16. Jette, A. M., Haley, S. M., Tao, W., Ni, P., Moed, R., Meyers, D., et al. (2007). Prospective evaluation of the AM-PAC-CAT in outpatient rehabilitation settings. Physical Therapy, 87(4), 385–398.

    PubMed  Google Scholar 

  17. Rose, M., Bjorner, J. B., Becker, J., Fries, J. F., & Ware, J. E. (2008). Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS). Journal of Clinical Epidemiology, 61(1), 17–33.

    Article  PubMed  CAS  Google Scholar 

  18. Steinberg, L., Thissen, D., & Wainer, H. (2000). Validity. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed., pp. 185–229). Mahwah, NJ: Lawerence Erlbaum Associates.

    Google Scholar 

  19. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.

    Google Scholar 

  20. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 287–334.

    Google Scholar 

  21. Custers, J. W., Hoijtink, H., van der Net, J., & Helders, P. J. (2000). Cultural differences in functional status measurement: Analyses of person fit according to the Rasch model. Quality of Life Research, 9(5), 571–578.

    Article  PubMed  CAS  Google Scholar 

  22. Lundgren-Nilsson, A., Grimby, G., Ring, H., Tesio, L., Lawton, G., Slade, A., et al. (2005). Cross-cultural validity of functional independence measure items in stroke: A study using Rasch analysis. Journal of Rehabilitation Medicine, 37(1), 23–31.

    Article  PubMed  Google Scholar 

  23. Swinkels, I. C. S., van den Ende, C. H. M., de Bakker, D., van der Wees, J., Hart, D. L., Deutscher, D., et al. (2007). Clinical databases in physical therapy. Physiotherapy Theory and Practice, 23(3), 153–167.

    Article  PubMed  CAS  Google Scholar 

  24. Hart, D. L., & Connolly, J. B. (2006). Pay-for-performance for physical therapy and occupational therapy: Medicare Part B services. Grant #18-P-93066/9-01. Health & Human Services/Centers for Medicare & Medicaid Services.

  25. Deutscher, D., Hart, D. L., Dickstein, R., Horn, S. D., & Gutvirtz, M. (2008). Implementing an integrated electronic outcomes and electronic health record process to create a foundation for clinical practice improvement. Physical Therapy, 88(2), 270–285.

    PubMed  Google Scholar 

  26. Binkley, J. M., Stratford, P. W., Lott, S. A., & Riddle, D. L. (1999). The Lower Extremity Functional Scale (LEFS): Scale development, measurement properties, and clinical application. North American Orthopaedic Rehabilitation Research Network. Physical Therapy, 79(4), 371–383.

    PubMed  CAS  Google Scholar 

  27. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573.

    Article  Google Scholar 

  28. Thissen, D., & Mislevy, R. J. (2000). Testing algorithms. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed., pp. 101–134). Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  29. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  30. Linacre, J. M. (1998). Estimating measures with known polytomous item difficulties. Rasch Measurement Transactions, 12(2), 638.

    Google Scholar 

  31. Sands, W. A., Waters, B. K., & McBride, J. R. (Eds.). (1997). Computerized adaptive testing. From inquiry to operation. Washington, D.C.: American Psychological Association.

    Google Scholar 

  32. World Health Organization. (2001). International classification of functioning, disability and health. Geneva: World Health Organization.

    Google Scholar 

  33. Lewin-Epstein, N., Sagiv-Schifter, T., Shabtai, E. L., & Shmueli, A. (1998). Validation of the 36-item short-form Health Survey (Hebrew version) in the adult population of Israel. Medical Care, 36(9), 1361–1370.

    Article  PubMed  CAS  Google Scholar 

  34. Samejima, F. (1969). Estimation of ability using a response pattern of graded responses. Psycometrika. Monograph 17.

  35. Linacre, J. M. (2008). A user’s guide to WINSTEPS. Chicago, IL: MESA.

    Google Scholar 

  36. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model. Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  37. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

    Article  PubMed  CAS  Google Scholar 

  38. Bjorner, J. B., Kreiner, S., Ware, J. E., Damsgaard, M. T., & Bech, P. (1998). Differential item functioning in the Danish translation of the SF-36. Journal of Clinical Epidemiology, 51(11), 1189–1202.

    Article  PubMed  CAS  Google Scholar 

  39. Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing procedures using the Graded Response Model. Applied Psychological Measurement, 13(2), 129–143.

    Article  Google Scholar 

  40. Fliege, H., Becker, J., Walter, O. B., Bjorner, J. B., Klapp, B. F., & Rose, M. (2005). Development of a computer-adaptive test for depression (D-CAT). Quality of Life Research, 14(10), 2277–2291.

    Article  PubMed  Google Scholar 

  41. Deutscher, D., Horn, S. D., Dickstein, R., Hart, D. L., Smout, R. J., Gutvirtz, M., et al. (2009). Associations between treatment processes, patient characteristics and outcomes in outpatient physical therapy practice. Archives of Physical Medicine and Rehabilitation. (in press).

  42. Crane, P. K., Cetin, K., Cook, K. F., Johnson, K., Deyo, R., & Amtmann, D. (2007). Differential item functioning impact in a modified version of the Roland-Morris Disability Questionnaire. Quality of Life Research, 16(6), 981–990.

    Article  PubMed  Google Scholar 

  43. Crane, P. K., Gibbons, L. E., Jolley, L., & van Belle, G. (2006). Differential item functioning analysis with ordinal logistic regression techniques DIFdetect and difwithpar. Medical Care, 44(11 Suppl 3), S115–S123.

    Article  PubMed  Google Scholar 

  44. Crane, P. K., Gibbons, L. E., Narasimhalu, K., Lai, J. S., & Cella, D. (2007). Rapid detection of differential item functioning in assessments of health-related quality of life: The functional assessment of cancer therapy. Quality of Life Research, 16(1), 101–114.

    Article  PubMed  Google Scholar 

  45. Crane, P. K., Gibbons, L. E., Ocepek-Welikson, K., Cook, K., Cella, D., Narasimhalu, K., et al. (2007). A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Quality of Life Research, 16(Suppl 1), 69–84.

    Article  PubMed  Google Scholar 

  46. Crane, P. K., Hart, D. L., Gibbons, L. E., & Cook, K. F. (2006). A 37-item shoulder functional status item pool had negligible differential item functioning. Journal of Clinical Epidemiology, 59(5), 478–484.

    Article  PubMed  Google Scholar 

  47. Crane, P. K., van Belle, G., & Larson, E. B. (2004). Test bias in a cognitive test: differential item functioning in the CASI. Statistics in Medicine, 23(2), 241–256.

    Article  PubMed  Google Scholar 

  48. PARSCALE for Windows. (2003). v 4.1 Lincolnwood, IL: Scientific Software International.

  49. Stata Statistical Software. (2007). Release 9.2. College Station, TX.

  50. Hart, D. L., Mioduski, J. E., Werneke, M. W., & Stratford, P. W. (2006). Simulated computerized adaptive test for patients with lumbar spine impairments was efficient and produced valid measures of function. Journal of Clinical Epidemiology, 59(9), 947–956.

    Article  PubMed  Google Scholar 

  51. Hart, D. L., Werneke, M. W., George, S. Z., Matheson, J. W., Wang, Y. C., Cook, K. F., et al. (2009) Screening for elevated levels of fear-avoidance beliefs regarding work or physical activities in patients receiving outpatient therapy. Physical Therapy, 89(8), 770–785.

    PubMed  Google Scholar 

  52. Wang, Y. C., Hart, D. L., Stratford, P. W., & Mioduski, J. E. (2009). Clinical interpretation of computerized adaptive test generated outcomes measures in patients with knee impairments. Archives of Physical Medicine and Rehabilitation. (in press).

  53. Jaeschke, R., Singer, J., & Guyatt, G. H. (1989). Measurement of health status. Ascertaining the minimal clinically important difference. Controlled Clinical Trials, 10(4), 407–415.

    Article  PubMed  CAS  Google Scholar 

  54. Cook, K. F., Choi, S. W., Crane, P. K., Deyo, R. A., Johnson, K. L., & Amtmann, D. (2008). Letting the CAT out of the bag: comparing computer adaptive tests and an 11-item short form of the Roland–Morris Disability Questionnaire. Spine, 33(12), 1378–1383.

    Article  PubMed  Google Scholar 

  55. Elhan, A. H., Oztuna, D., Sutlay, S., Kucukdeveci, A., & Tennant, A. (2008). An initial application of computerized adaptive testing (CAT) for measuring disability in patients with low back pain. BMC Musculoskeletal Disorders, 9, 166.

    Article  PubMed  Google Scholar 

  56. Nandakumar, R., & Roussos, L. (2001). CATSIB: A modified SIBTEST procedure to detect differential item functioning in computerized adaptive tests. Report no. LSAC-R-97-11. Princeton, NJ: Law School Admission Council.

  57. Muthén, L. K., & Muthén, B. O. (2006). Mplus user’s guide (4th ed.). Los Angeles, CA: Muthén & Muthén.

    Google Scholar 

  58. Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44(11 Suppl 3), S182–S188.

    Article  PubMed  Google Scholar 

  59. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114(3), 552–566.

    Article  PubMed  CAS  Google Scholar 

  60. Lei, P. W., Chen, S. Y., & Yu, L. (2006). Comparing methods of assessing differential item functioning in a computerized adaptive testing environment. Journal of Educational Measurement, 43(3), 245–264.

    Article  Google Scholar 

  61. Maldonado, G., & Greenland, S. (1993). Simulation study of confounder-selection strategies. American Journal of Epidemiology, 138(11), 923–936.

    PubMed  CAS  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Alan Tennant, BA, Ph.D. for his insightful comments regarding early statistical analyses and results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dennis L. Hart.

Appendices

Appendix 1

Primer for using difwithpar: detailed DIF detection methods

Crane et al. [43, 47] have developed an approach to DIF assessment that combines ordinal logistic regression and IRT. Details of this approach are outlined in earlier publications.

Here, we use IRT scores initially to evaluate items for DIF. We examine three ordinal logistic regression models for each item for each demographic category (labeled here as “group”) selected for analysis:

$$ {\text{Logit}}\,P(Y = 1|\theta ,\,{\text{group}}) = \beta_{1}^{*} \theta + \beta_{2}^{*} \,{\text{group}} + \beta_{3}^{*} \theta^{*}\,{\text{group}}\quad ( {\text{model}}\, 1 ) $$
$$ {\text{Logit}}\,P(Y = 1|\theta ,\;{\text{group}}) = \beta_{1}^{*} \theta + \beta_{2}^{*} \;{\text{group}}\quad ( {\text{model}}\, 2 ) $$
$$ {\text{Logit}}\,P(Y = 1|\theta ) = \beta_{1}^{*} \theta \quad ({\text{model}}\, 3) $$

In these equations, P(Y = 1) is the probability of endorsing an item, θ is the IRT estimate of functional status, and group is the demographic category.

Two types of DIF are identified in the literature. In items with non-uniform DIF, demographic interference between ability level and item responses differs at varying levels of functional status. In items with uniform DIF, this interference is the same across all levels of functional status.

To detect non-uniform DIF, we compare the log likelihoods of models 1 and 2 using a χ2 test, α = 0.05. To detect uniform DIF, we determine the relative difference between the parameters associated with θ (β1 from models 2 and 3) using the formula |1(model 2) − β1(model 3))/β1(model 3)|. If the relative difference is large, group membership interferes with the expected relationship between functional status and item responses. The literature provides little guidance regarding how large the relative difference should be. A simulation by Maldonado and Greenland on confounder selection strategies used a 10% change criterion in a very different context [61], which we followed in this data set.

Crane et al. [43, 47] have developed an approach to generate scores that account for DIF. When DIF is found, we create new datasets: items without DIF have item parameters estimated from the whole sample, while items with DIF have demographic-specific item parameters estimated.

We have written Stata code that automates these steps called difwithpar. As with runparscale, this can be obtained by typing

  • .scc install difwithpar

at the Stata prompt. To evaluate items for DIF with respect to age, Stata code might look like this:

  • .difwithpar dep1-dep18, id(studyid) ability(theta_naive) group(age) mul(age) runname(a01) ubch(0.10) nupv(0.001)

That code identifies the items to examine for DIF, a unique study identifier that will be used to merge scores, the ability estimate to use for conditioning, the name of the covariate to examine for DIF, the run name (“a01” here), the value of change in the β coefficient for uniform DIF detection (“ubch,” here set to the default of 0.10), and the P value for non-uniform DIF (“nupv,” here set to 0.001 to accommodate the large data set).

This single line of code will evaluate each of the items in the item list (here, dep1 through dep18) using the three ordinal logistic regression models described earlier. It will apply the criteria the user specifies (e.g., the P value for non-uniform DIF and the change in β value for uniform DIF) to flag items as having or not having DIF. For items flagged with DIF, the code will generate new items. For example, if item 3 was flagged with DIF by age, difwithpar will create items called dep3_age1, dep3_age2, and dep3_age3. The item dep3_age1 has missing values for all participants for whom age = 2 or age = 3, and the original dep3 responses for all participants for whom age = 1. The difwithpar procedure proceeds to create a Parscale dataset and Parscale code that incorporates these new items. In this example, the Parscale run would involve dep1, dep2, dep3_age1, dep3_age2, dep3_age3, dep4, dep5, dep6, dep7, … dep18. The difwithpar program then runs that code in Parscale, and imports the IRT score that accounts for DIF found in dep3, which will be called theta_a01, and its standard error, setheta_a01.

Spurious false-positive and false-negative results may occur if the functional status score (θ) used for DIF detection includes many items with DIF [43]. We therefore use an iterative approach for each covariate. We generate IRT scores that account for DIF, and use these as the FS score to detect DIF. If different items are identified with DIF, we repeat the process described earlier, modifying the assignments of items based on the most recent round of DIF detection. If the same items are identified with DIF on successive rounds, we are satisfied that we identified items with DIF (as opposed to spurious findings).

From a practical standpoint, what we do is to modify the initial line of difwithpar code, which we repeat here:

  • .difwithpar dep1-dep18, id(studyid) ability(theta_naive) group(age) mul(age) runname(a01) ubch(0.1) nupv(0.001)

The modified code looks like this:

  • .difwithpar dep1-dep18, id(studyid) ability(theta_a01) group(age) mul(age) runname(a02) ubch(0.1) nupv(0.001)

We have changed the ability term for DIF detection to the recently calculated theta_a01, and changed the run name to a02. We look at the items flagged with DIF in this run and compare to those found on earlier runs. We continue this process until the same items are identified with DIF on repeated runs. This may take several runs. In this example, say it took three runs to converge on the same list of items with DIF for age. The variable theta_a02 would be our estimate of functional status that accounted for DIF with respect to age.

We performed two sets of analyses. We initially examined each covariate in turn for DIF, using the initial IRT score that did not account for DIF as the conditioning variable for each covariate. We then assessed DIF impact. If DIF impact was negligible, we did not account for items with DIF simultaneously, which is what we found in our data. However, to demonstrate the effect of simultaneously controlling for DIF with all covariates regardless of DIF impact, in our final analyses, we decided to check the effect of assessing DIF simultaneously regardless of whether there was important DIF impact. Therefore, we accounted for multiple sources of DIF simultaneously. Here we describe further details of the methods for doing this.

“Appendix 5” informs us that we found one item with some form of DIF with respect to age—standing for 1 h. The final IRT estimate of FS levels included separate calibrations of this item in those 18 to 45, >45 to 65, and >65 years old. To then account for DIF according to our next covariate symptom acuity, we modify the data set to evaluate 18 items:

  • 17 items that did not have DIF related to age,

  • Standing among patients 18–45 years,

  • Standing among patients >45–65 years,

  • Standing among patients >65 years.

We have written Stata code called mergevirtual that we run after difwithpar has converged on a final answer for a covariate to bring in the new items—in this case, standing separately in the three levels of age. The code for mergevirtual is automatically loaded when difwithpar is installed. After running difwithpar with respect to age, we enter the following line at the Stata prompt:

  • .mergevirtual studyid “C:\Data\mydataset”

New age-specific items for standing are then added to the data set. We proceed to analyze DIF with respect to acuity for these 20 items, and use as our conditioning variable the IRT score that accounted for DIF with respect to age.

This is operationalized by the following line of code:

  • .difwithpar dep1 dep2 dep3_a1 dep3_a2 dep3_a3 dep4 dep5 dep6 dep7 dep8 dep9 dep10 dep11 dep12 dep13 dep14 dep15 dep16 dep17 dep18, id(studyid) gr(acuitygroup) mul(acuitygroup) run(all01) ab(theta_a02) ubch(0.1) nupv(0.05)

It is often useful to re-name the group-specific items (e.g., rename dep3_age1 as dep3_a1). This code tells difwithpar to use the final IRT score that accounted for DIF with respect to age (theta_a02) as the conditioning variable for DIF with respect to age group. The next run would look like this:

  • .difwithpar dep1 dep2 dep3_a1 dep3_a2 dep3_a3 dep4 dep5 dep6 dep7 dep8 dep9 dep10 dep11 dep12 dep13 dep14 dep15 dep16 dep17 dep18, id(studyid) gr(acuitygroup) mul(acuitygroup) run(all02) ab(theta_all01) ubch(0.1) nupv(0.001)

Now, we have modified the previous code to use the new FS estimate—theta_all01 rather than theta_a02—and specified that the run name is all02. These steps continue until the items flagged with DIF have converged. The resulting FS estimate accounts for DIF related to age and symptom acuity. For example, if it took five runs to converge on a common list of items with DIF for acuity, the thetaall_05 term would account for DIF related to both age and acuity. Analyses then proceed by merging the new items back into the dataset, again using the code

  • .mergevirtual id(studyid) “C:\Data\mydataset”

Analyses would then proceed by moving to the next covariate. For example, let’s say that DIF was found with respect to acuity for items dep1 and the previous age (18 to <45 years) versus standing item dep3_a1. Six new variables would be added to the dataset: dep1_acuitygroup1, dep1_acuitygroup2, dep1_acuitygroup3, dep3_a1_acuitygroup1, dep3_a1_acuitygroup2, and dep3_a1_acuitygroup3. Re-name the longer variable names dep1_ac1, dep1_ac2, dep1_ac3, dep3_a1_ac1, dep3_a1_ac2, and dep3_a1_ac3, and then the next difwithpar run would look like this:

  • .difwithpar dep1_ac1 dep1_ac2 dep1_ac3 dep2 dep3_a1_ac1, dep3_a1_ac2, dep3_a1_ac3, dep3_a2 dep3_a3 dep4 dep5 dep6 dep7 dep8 dep9 dep10 dep11 dep12 dep13 dep14 dep15 dep16 dep17 dep18, id(studyid) gr(surgery) run(all06) ab(theta_all05) ubch(0.1) nupv(0.001)

This code tells difwithpar to analyze those items for DIF with respect to surgery group (“surgery”), conditioning on the IRT FS estimate that accounted for DIF with respect to age and acuity (“thetaall_05”). Subsequent runs proceed until all covariates of interest have been included.

The difwithpar package is capable of analyzing DIF in multiple groups. For example, our age and acuity terms included three groups. The “mul” option tells difwithpar to treat age and acuity groups as categorical group indicators. The difwithpar package will generate interaction terms between age and theta, and model 1 will look like this:

$$ {\text{Logit}}\,P(Y = 1|\theta ,{\text{agegroup}}) = \beta_{1}^{*} \theta + \beta_{2,2}^{*} \,{\text{agegroup}}_{2} + \beta_{2,3}^{*} \,{\text{agegroup}}_{3} + \beta_{2,4}^{*} \,{\text{agegroup}}_{4} + \beta_{3,2}^{*} \theta^{*}\,{\text{agegroup}}_{2} + \beta_{3,3}^{*} \theta^{*}\,{\text{agegroup}}_{3} + \beta_{3,4}^{*} \theta^{*}\,{\text{agegroup}}_{4} \quad ({\text{model}}\, 1^{*}) $$

Model 2 will then look like this:

$$ {\text{Logit}}\,P(Y = 1|\theta ,{\text{agegroup}}) = \beta_{1}^{*} \theta + \beta_{2,2}^{*} \,{\text{agegroup}}_{2} + \beta_{2,3}^{*} \,{\text{agegroup}}_{3} + \beta_{2,4}^{*} \,{\text{agegroup}}_{4} \quad ({\text{model}}\, 2^{*}) $$

And model 3 is unchanged:

$$ {\text{Logit}}\,P(Y = 1|\theta ,\,{\text{agegroup}}) = \beta_{1}^{*} \theta \quad ({\text{model}}\,3) $$

For non-uniform DIF for categorical groups, twice the differences in log likelihood of models 1* and 2* are compared to a χ2 distribution with degrees of freedom equal to the number of groups minus 1. In this instance, there are four groups, and three degrees of freedom, corresponding to the three terms that are different between models 1* and 2*. For uniform DIF, the β1 coefficient from model 2* is compared to the β1 coefficient from model 3.

Appendix 2

See Table 3.

Table 3 Item exposure rates of the CAT in United States and Israel

Appendix 3

See Table 4.

Table 4 Item parameters by country

Appendix 4

See Table 5.

Table 5 Presence of differential item functioning related to covariates in the Israeli sample

Appendix 5

See Table 6.

Table 6 Presence of differential item functioning related to covariates in the combined United States and Israeli sample

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hart, D.L., Deutscher, D., Crane, P.K. et al. Differential item functioning was negligible in an adaptive test of functional status for patients with knee impairments who spoke English or Hebrew. Qual Life Res 18, 1067–1083 (2009). https://doi.org/10.1007/s11136-009-9517-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11136-009-9517-8

Keywords

Navigation