Skip to main content
Log in

A comparison of methods to address item non-response when testing for differential item functioning in multidimensional patient-reported outcome measures

  • Published:
Quality of Life Research Aims and scope Submit manuscript

Abstract

Purpose

Item non-response (i.e., missing data) may mask the detection of differential item functioning (DIF) in patient-reported outcome measures or result in biased DIF estimates. Non-response can be challenging to address in ordinal data. We investigated an unsupervised machine-learning method for ordinal item-level imputation and compared it with commonly-used item non-response methods when testing for DIF.

Methods

Computer simulation and real-world data were used to assess several item non-response methods using the item response theory likelihood ratio test for DIF. The methods included: (a) list-wise deletion (LD), (b) half-mean imputation (HMI), (c) full information maximum likelihood (FIML), and (d) non-negative matrix factorization (NNMF), which adopts a machine-learning approach to impute missing values. Control of Type I error rates were evaluated using a liberal robustness criterion for α = 0.05 (i.e., 0.025–0.075). Statistical power was assessed with and without adoption of an item non-response method; differences > 10% were considered substantial.

Results

Type I error rates for detecting DIF using LD, FIML and NNMF methods were controlled within the bounds of the robustness criterion for > 95% of simulation conditions, although the NNMF occasionally resulted in inflated rates. The HMI method always resulted in inflated error rates with 50% missing data. Differences in power to detect moderate DIF effects for LD, FIML and NNMF methods were substantial with 50% missing data and otherwise insubstantial.

Conclusion

The NNMF method demonstrated comparable performance to commonly-used non-response methods. This computationally-efficient method represents a promising approach to address item-level non-response when testing for DIF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

Study data for the real-world analyses were secondary data. These data were provided under specific data sharing agreements only for approved use for this project. The original source data are not owned by the researchers and as such cannot be provided to a public repository. Where necessary and with appropriate approvals, the original source data for this project may be reviewed with the consent of the data providers and approval by the required privacy and ethical review bodies.

Code availability

Simulation codes are provided in the supplementary material.

References

  1. Johnston, B. C., Patrick, D. L., Thorlund, K., Busse, J. W., da Costa, B. R., Schünemann, H. J., & Guyatt, G. H. (2013). Patient-reported outcomes in meta-analyses –part 2: Methods for improving interpretability for decision-makers. Health and Quality of Life Outcomes, 11(211), 1–9. https://doi.org/10.1186/1477-7525-11-211

    Article  Google Scholar 

  2. Guyatt, G. H., Feeny, D. H., & Patrick, D. L. (1993). Measuring health-related quality of life. Annals of Internal Medicine, 118(8), 622–629.

    Article  CAS  Google Scholar 

  3. Berzon, R., Hays, R. D., & Shumaker, S. A. (1993). International use, application and performance of health-related quality of life instruments. Quality of Life Research, 2(6), 367–368. https://doi.org/10.1007/BF00422214

    Article  CAS  PubMed  Google Scholar 

  4. Bulut, O., & Kim, D. (2021). The use of data imputation when investigating dimensionality in Sparse data from computerized adaptive tests. Journal of Applied Testing Technology, 22(2), 1.

    Google Scholar 

  5. Jia, F., & Wu, W. (2019). Evaluating methods for handling missing ordinal data in structural equation modeling. Behavior Research Methods, 51(5), 2337–2355. https://doi.org/10.3758/s13428-018-1187-4

    Article  PubMed  Google Scholar 

  6. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley.

    Book  Google Scholar 

  7. Bell, M. L., & Fairclough, D. L. (2014). Practical and statistical issues in missing data for longitudinal patient-reported outcomes. Statistical Methods in Medical Research, 23(5), 440–459. https://doi.org/10.1177/0962280213476378

    Article  PubMed  Google Scholar 

  8. Teresi, J. A., & Fleishman, J. A. (2007). Differential item functioning and health assessment. Quality of Life Research, 16(SUPPL. 1), 33–42. https://doi.org/10.1007/s11136-007-9184-6

    Article  PubMed  Google Scholar 

  9. Banks, K. (2015). An introduction to missing data in the context of differential item functioning. Practical Assessment, Research and Evaluation, 20(12), 1–10.

    Google Scholar 

  10. Finch, H. (2011). The use of multiple imputation for missing data in uniform DIF analysis: Power and type I error rates. Applied Measurement in Education, 24(4), 281–301. https://doi.org/10.1080/08957347.2011.607054

    Article  Google Scholar 

  11. Donneau, A. F., Mauer, M., Molenberghs, G., & Albert, A. (2015). A simulation study comparing multiple imputation methods for incomplete longitudinal ordinal data. Communications in Statistics, 44(5), 1311–1338. https://doi.org/10.1080/03610918.2013.818690

    Article  Google Scholar 

  12. Eekhout, I., De Vet, H. C. W., Twisk, J. W. R., Brand, J. P. L., De Boer, M. R., & Heymans, M. W. (2014). Missing data in a multi-item instrument were best handled by multiple imputation at the item score level. Journal of Clinical Epidemiology, 67(3), 335–342. https://doi.org/10.1016/j.jclinepi.2013.09.009

    Article  PubMed  Google Scholar 

  13. Kombo, A. Y., Mwambi, H., & Molenberghs, G. (2017). Multiple imputation for ordinal longitudinal data with monotone missing data patterns. Journal of Applied Statistics, 44(2), 270–287. https://doi.org/10.1080/02664763.2016.1168370

    Article  Google Scholar 

  14. Raghunathan, T. E., Lepkowski, J. M., & Van Hoewyk, J. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1), 85–95.

    Google Scholar 

  15. Enders, C. K. (2010). Applied missing data analysis. The Guilford Press.

    Google Scholar 

  16. Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing measurement invariance in longitudinal data with ordered-categorical measures. Psychological Methods, 22(3), 486–506.

    Article  Google Scholar 

  17. Chen, P. Y., Wu, W., Garnier-Villarreal, M., Kite, B. A., & Jia, F. (2020). Testing measurement invariance with ordinal missing data: A comparison of estimators and missing data techniques. Multivariate Behavioral Research, 55(1), 87–101.

    Article  Google Scholar 

  18. Donneau, A. F., Mauer, M., Lambert, P., Molenberghs, G., & Albert, A. (2015). Simulation-based study comparing multiple imputation methods for non-monotone missing ordinal data in longitudinal settings. Journal of Biopharmaceutical Statistics, 25(3), 570–601.

    Article  CAS  Google Scholar 

  19. Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). CRC Press.

    Book  Google Scholar 

  20. Lin, X. E., & Boutros, P. C. (2020). Optimization and expansion of non-negative matrix factorization. BMC Bioinformatics, 21(1), 1–10. https://doi.org/10.1186/s12859-019-3312-5

    Article  Google Scholar 

  21. Zhang, S., Wang, W., Ford, J., & Makedon, F. (2006). Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the Sixth SIAM International Conference on Data Mining (pp. 549–553). https://doi.org/10.1137/1.9781611972764.58

  22. Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.

    PubMed  Google Scholar 

  23. Wold, H. (1975). Soft modelling by latent variables: The nonlinear iterative partial least squares (NIPALS) approach. Journal of Applied Probability, 12(S1), 117–142.

    Article  Google Scholar 

  24. Fairclough, A. D. L., & Cella, D. F. (1996). Functional assessment of cancer therapy (FACT-G): Non-response to individual questions. Quality of Life Research, 5(3), 321–329.

    Article  CAS  Google Scholar 

  25. Enders, C. K. (2004). The impact of missing data on sample reliability estimates: Implications for reliability reporting practices. Educational and Psychological Measurement, 64(3), 419–436. https://doi.org/10.1177/0013164403261050

    Article  Google Scholar 

  26. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.

    Article  CAS  Google Scholar 

  27. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147

    Article  PubMed  Google Scholar 

  28. Ayilara, O. F., Zhang, L., Sajobi, T. T., Sawatzky, R., Bohm, E., & Lix, L. M. (2019). Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health and Quality of Life Outcomes, 17(1), 106. https://doi.org/10.1186/s12955-019-1181-2

    Article  PubMed  PubMed Central  Google Scholar 

  29. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. https://doi.org/10.1038/44565

    Article  CAS  PubMed  Google Scholar 

  30. Pauca, V. P., Piper, J., & Plemmons, R. J. (2006). Nonnegative matrix factorization for spectral data analysis. Linear Algebra and Its Applications, 416(1), 29–47. https://doi.org/10.1016/j.laa.2005.06.025

    Article  Google Scholar 

  31. Lin, X. E., & Boutros, P. (2019). NNLM: a package for fast and versatile nonnegative matrix factorization.

  32. Forero, C. G., & Maydeu-Olivares, A. (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14(3), 275–299. https://doi.org/10.1037/a0015825

    Article  PubMed  Google Scholar 

  33. Jiang, S., Wang, C., & Weiss, D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2016.00109

    Article  PubMed  PubMed Central  Google Scholar 

  34. Olsbjerg, M., & Christensen, K. B. (2015). Modeling local dependence in longitudinal IRT models. Behavior Research Methods, 47(4), 1413–1424. https://doi.org/10.3758/s13428-014-0553-0

    Article  PubMed  Google Scholar 

  35. De Ayala, R. J. (1994). The influence of multidimensionality on the graded response model. Applied Psychological Measurement, 18(2), 155–170.

    Article  Google Scholar 

  36. Bulut, O., & Sunbul, Ö. (2017). Monte Carlo simulation studies in item response theory with the R programming language. Journal of Measurement and Evaluation in Education and Psychology, 8(3), 266–287. https://doi.org/10.21031/epod.305821

    Article  Google Scholar 

  37. Finch, H. W. (2011). The impact of missing data on the detection of nonuniform differential item functioning. Educational and Psychological Measurement, 71(4), 663–683.

    Article  Google Scholar 

  38. Schouten, R. M., Lugtig, P., & Vink, G. (2018). Generating missing values for simulation purposes: A multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88(15), 2909–2930. https://doi.org/10.1080/00949655.2018.1491577

    Article  Google Scholar 

  39. Nassiri, V., Molenberghs, G., Verbeke, G., & Barbosa-Breda, J. (2020). Iterative multiple imputation: A framework to determine the number of imputed datasets. American Statistician, 74(2), 125–136. https://doi.org/10.1080/00031305.2018.1543615

    Article  Google Scholar 

  40. Goretzko, D. (2021). Factor retention in exploratory factor analysis with missing data. Educational and Psychological Measurement. https://doi.org/10.1177/00131644211022031

    Article  PubMed  PubMed Central  Google Scholar 

  41. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

    Article  Google Scholar 

  42. Bulut, O., & Suh, Y. (2017). Detecting multidimensional differential item functioning with the multiple indicators multiple causes model, the item response theory likelihood ratio test, and logistic regression. Frontiers in Education, 2(October), 1–14. https://doi.org/10.3389/feduc.2017.00051

    Article  Google Scholar 

  43. Bourion-Bédès, S., Schwan, R., Laprevote, V., Bédès, A., Bonnet, J. L., & Baumann, C. (2015). Differential item functioning (DIF) of SF-12 and Q-LES-Q-SF items among French substance users. Health and Quality of Life Outcomes. https://doi.org/10.1186/s12955-015-0365-7

    Article  PubMed  PubMed Central  Google Scholar 

  44. Yadegari, I., Bohm, E., Ayilara, O. F., Zhang, L., Sawatzky, R., Sajobi, T. T., & Lix, L. M. (2019). Differential item functioning of the SF-12 in a population-based regional joint replacement registry. Health and Quality of Life Outcomes, 17(1), 1–11. https://doi.org/10.1186/s12955-019-1166-1

    Article  Google Scholar 

  45. Lix, L. M., Wu, X., Hopman, W., Mayo, N., Sajobi, T. T., Liu, J., Prior, J. C., Papaioannou, A., Josse, R. G., Towheed, T. E., Davison, K. S., & Sawatzky, R. (2016). Differential item functioning in the SF-36 physical functioning and mental health sub scales: A population-based investigation in the Canadian multicentre osteoporosis study. PLoS ONE, 11(3), 1–13. https://doi.org/10.1371/journal.pone.0151519

    Article  CAS  Google Scholar 

  46. Kwon, J. Y., & Sawatzky, R. (2017). Examining gender-related differential item functioning of the veterans rand 12-item health survey. Quality of Life Research, 26(10), 2877–2883. https://doi.org/10.1007/s11136-017-1638-x

    Article  PubMed  Google Scholar 

  47. Stout, W., Li, H. H., Nandakumar, R., & Bolt, D. (1997). MULTISIB: A procedure to investigate DIF when a test is intentionally two-dimensional. Applied Psychological Measurement, 21(3), 195–213. https://doi.org/10.1177/01466216970213001

    Article  Google Scholar 

  48. Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

    Article  Google Scholar 

  49. Bradley, J. V. (1978). Robustness. British Journal of Mathematical & Statistical Psychology, 31(2), 144–152.

    Article  Google Scholar 

  50. Kaplan, D. (1989). A study of the sampling variability and z-values of parameter estimates from misspecified structural equation models. Multivariate Behavioral Research, 24(1), 41–57.

    Article  CAS  Google Scholar 

  51. Curran, P., & West, S. G. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16–29.

    Article  Google Scholar 

  52. Zhang, L., Lix, L. M., Ayilara, O., Sawatzky, R., & Bohm, E. R. (2018). The effect of multimorbidity on changes in health-related quality of life following hip and knee arthroplasty. Bone and Joint Journal, 100B(9), 1168–1174. https://doi.org/10.1302/0301-620X.100B9.BJJ-2017-1372.R1

    Article  Google Scholar 

  53. Salyers, M., Bosworth, H., Swanson, J., Lamb-Pagone, J., & Osher, F. (2000). Reliability and validity of the SF-12 health survey among people with severe mental illness. Medical Care, 38, 1141–1150.

    Article  CAS  Google Scholar 

  54. Cernin, P., Cresci, K., Jankowski, T., & Lichtenberg, P. (2010). Reliability and validity testing of the short-form health survey in a sample of community-dwelling African American older adults. Journal of Nursing Measurement, 18, 49–59.

    Article  Google Scholar 

  55. Cheak-Zamora, N., Wyrwich, K., & McBride, T. (2009). Reliability and validity of the SF-12v2 in the medical expenditure panel survey. Quality of Life Research, 18, 727–735.

    Article  Google Scholar 

  56. Yosef, H. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4), 800–802.

    Article  Google Scholar 

  57. Meade, A. W., & Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97(5), 1016–1031. https://doi.org/10.1037/a0027934

    Article  PubMed  Google Scholar 

  58. Sedivy, S. K., Zhang, B., & Traxel, N. M. (2006). Detection of differential item functioning with polytomous items in the presence of missing data. In: Annual meeting of the National Council on Measurement in Education

  59. Rombach, I., Rivero-Arias, O., Gray, A. M., Jenkinson, C., & Burke, Ó. (2016). The current practice of handling and reporting missing outcome data in eight widely used PROMs in RCT publications: A review of the current literature. Quality of Life Research, 25(7), 1613–1623. https://doi.org/10.1007/s11136-015-1206-1

    Article  PubMed  PubMed Central  Google Scholar 

  60. Finch, H. (2008). Estimation of item response theory parameters in the presence of missing data. Journal of Educational Measurement, 45(3), 225–245.

    Article  Google Scholar 

  61. Finch, W. H. (2010). Imputation methods for missing categorical questionnaire data: A comparison of approaches. Journal of Data Science, 8(3), 361–378. https://doi.org/10.6339/jds.2010.08(3).612

    Article  Google Scholar 

Download references

Funding

Funding for this study was provided by the Canadian Institutes of Health Research (Grant # MOP-142404). OFA is supported by funding from the Visual and Automated Disease Analytics (VADA) Program at the University of Manitoba. LML is supported by a Tier 1 Canada Research Chair in Methods for Electronic Health Data Quality. MJJ acknowledges the research support of the Natural Sciences and Engineering Research Council of Canada (NSERC). RB acknowledges the research support of the Canadian Institutes of Health Research.

Author information

Authors and Affiliations

Authors

Contributions

All authors conceived the study and contributed to the design of the simulation study and analysis plan for the numeric example. OFA, MJJ and LML conducted the analysis and prepared the draft manuscript. All authors reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Lisa M. Lix.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical approval

This study received ethical approval from the University of Manitoba Health Research Ethics Board.

Consent to participate

Informed written consent was obtained from all participants whose information was used in the analyses of data from the Winnipeg Regional Health Authority Joint Replacement Registry.

Consent for publication

Not applicable. There is no identifying information for participants contained in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 318 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ayilara, O.F., Sajobi, T.T., Barclay, R. et al. A comparison of methods to address item non-response when testing for differential item functioning in multidimensional patient-reported outcome measures. Qual Life Res 31, 2837–2848 (2022). https://doi.org/10.1007/s11136-022-03129-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11136-022-03129-8

Keywords

Navigation