A general framework and an R package for the detection of dichotomous differential item functioning

Abstract

Differential item functioning (DIF) is an important issue of interest in psychometrics and educational measurement. Several methods have been proposed in recent decades for identifying items that function differently between two or more groups of examinees. Starting from a framework for classifying DIF detection methods and from a comparative overview of the most traditional methods, an R package for nine methods, called difR, is presented. The commands and options are briefly described, and the package is illustrated through the analysis of a data set on verbal aggression.

References

  1. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.

    Article  Google Scholar 

  2. Agresti, A. (1990). Categorical data analysis. New York: Wiley.

    Google Scholar 

  3. Aguerri, M. E., Galibert, M. S., Attorresi, H. F., & Marañón, P. P. (2009). Erroneous detection of nonuniform DIF using the Breslow— Day test in a short test. Quality & Quantity, 43, 35–44.

    Article  Google Scholar 

  4. Angoff, W. H., & Ford, S. F. (1973). Item—race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–106.

    Article  Google Scholar 

  5. Bates, D., & Maechler, M. (2009). lme4: Linear mixed-effects models using S4 classes. R package Version 0.999375-32. Available from https://r-forge.r-project.org/R/?group_id=60.

  6. Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.

    Google Scholar 

  7. Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research: Vol. 1. The analysis of case—control studies (Scientific Publication No. 32). Lyon, France: International Agency for Research on Cancer.

    Google Scholar 

  8. Breslow, N. E., & Liang, K. Y. (1982). The variance of the Mantel— Haenszel estimator. Biometrics, 38, 943–952.

    Article  Google Scholar 

  9. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.

    Google Scholar 

  10. Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.

    Article  Google Scholar 

  11. Cardall, C., & Coffman, W. E. (1964). A method for comparing the performance of different groups on the items in a test (Research Bulletin 64–61). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  12. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement, 17, 31–44.

    Google Scholar 

  13. Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6, 269–279.

    Article  Google Scholar 

  14. Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational & Psychological Measurement, 28, 61–75.

    Article  Google Scholar 

  15. Cook, L. L., & Eignor, D. R. (1991). NCME instructional module: IRT equating methods. Educational Measurement, 10, 37–45.

    Google Scholar 

  16. De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.

    Google Scholar 

  17. Dorans, N. J. (1989). Two new approaches to assessing differential item functioning. Standardization and the Mantel-Haenszel method. Applied Measurement in Education, 2, 217–233.

    Article  Google Scholar 

  18. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  19. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368.

    Article  Google Scholar 

  20. Dorans, N. J., Schmitt, A. P., & Bleistein, C. A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educational Measurement, 29, 309–319.

    Article  Google Scholar 

  21. Fidalgo, Á. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research, 5, 43–53.

    Google Scholar 

  22. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: A comparison of four methods. Educational & Psychological Measurement, 67, 565–582.

    Article  Google Scholar 

  23. Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational & Behavioral Statistics, 23, 244–253.

    Google Scholar 

  24. Hauck, W. W. (1979). The large sample variance of the Mantel-Haenszel estimator of a common odds ratio. Biometrics, 35, 817–819.

    Article  Google Scholar 

  25. Holland, P. W., & Thayer, D. T. (1985). An alternate definition of the ETS delta scale of item difficulty (Research Report RR-85-43). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  26. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  27. Ironson, G. H., & Subkoviak, M. J. (1979). A comparison of several methods of assessing item bias. Journal of Educational Measurement, 16, 209–225.

    Article  Google Scholar 

  28. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.

    Article  Google Scholar 

  29. Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT differential item functioning analysis. Applied Psychological Measurement, 16, 158.

    Article  Google Scholar 

  30. Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.

    Article  Google Scholar 

  31. Lautenschlager, G. J., & Park, D.-G. (1988). IRT item bias detection procedures: Issues of model misspecification, robustness, and parameter linking. Applied Psychological Measurement, 12, 365–376.

    Article  Google Scholar 

  32. Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V Program for Computing the Simultaneous Item Bias DIF Statistics [Computer program]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.

    Google Scholar 

  33. Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.

    Article  Google Scholar 

  34. Lord, F. M. (1976). A study of item bias, using item characteristic curve theory. Princeton, NJ: Educational Testing Service.

    Google Scholar 

  35. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  36. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.

    PubMed  Google Scholar 

  37. Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational & Psychological Measurement, 54, 284–291.

    Article  Google Scholar 

  38. Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). New York: Springer.

    Google Scholar 

  39. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.

    Article  Google Scholar 

  40. Mislevy, R. J., & Bock, R. D. (1984). BILOG: Item analysis and test scoring with binary logistic models [Computer program]. Mooresville, IN: Scientific Software.

    Google Scholar 

  41. Mislevy, R. J., & Stocking, M. L. (1989). A consumer’s guide to LOGIST and BILOG. Applied Psychological Measurement, 13, 57–75.

    Article  Google Scholar 

  42. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.

    Article  Google Scholar 

  43. Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257–274.

    Article  Google Scholar 

  44. Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.

    Google Scholar 

  45. Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.

    Article  Google Scholar 

  46. Penfield, R. D. (2003). Applying the Breslow-Day test of trend in odds ratio heterogeneity to the analysis of nonuniform DIF. Alberta Journal of Educational Research, 49, 231–243.

    Google Scholar 

  47. Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system. Applied Psychological Measurement, 29, 150–151.

    Article  Google Scholar 

  48. Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.

    Google Scholar 

  49. Philips, A., & Holland, P. W. (1987). Estimators of the variance of the Mantel-Haenszel log-odds-ratio estimate. Biometrics, 43, 425–431.

    Article  Google Scholar 

  50. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.

    Article  Google Scholar 

  51. Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.

    Article  Google Scholar 

  52. Raju, N. S. (1995). DFITPU: A FORTRAN program for calculating DIF/ DTF [Computer program]. Atlanta: Georgia Institute of Technology. R Development Core Team (2008). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  53. Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response analysis. Journal of Statistical Software, 17, 1–25.

    Google Scholar 

  54. Robins, J., Breslow, N., & Greenland, S. (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and largestrata limiting models. Biometrics, 42, 311–323.

    PubMed  Article  Google Scholar 

  55. Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A FORTRAN program for DIF analysis of dichotomously scored item response data [Computer program]. Amherst, MA: University of Massachusetts.

    Google Scholar 

  56. Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215–230.

    Article  Google Scholar 

  57. Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17, 1–10.

    Article  Google Scholar 

  58. Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement, 16, 143–152.

    Article  Google Scholar 

  59. Shealy, R., & Stout, W. [F.] (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.

    Article  Google Scholar 

  60. Shepard, L. [A.], Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational & Behavioral Statistics, 6, 317–375.

    Article  Google Scholar 

  61. Smits, D. J. M., De Boeck, P., & Vansteelandt, K. (2004). The inhibition of verbally aggressive behaviour. European Journal of Personality, 18, 537–555.

    Article  Google Scholar 

  62. Soares, T. M., Gonçalves, F. B., & Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational & Behavioral Statistics, 34, 348–377.

    Article  Google Scholar 

  63. Somes, G. W. (1986). The generalized Mantel-Haenszel statistic. American Statistician, 40, 106–108.

    Article  Google Scholar 

  64. Spielberger, C. D. (1988). State-Trait Anger Expression Inventory research edition: Professional manual. Odessa, FL: Psychological Assessment Resources.

    Google Scholar 

  65. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.

    Article  Google Scholar 

  66. Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer software]. Chapel Hill: University of North Carolina, L. L. Thurstone Psychometric Laboratory.

    Google Scholar 

  67. Thissen, D., Chen, W.-H., & Bock, R. D. (2003). MULTILOG 7 for Windows: Multiple-category item analysis and test scoring using item response theory [Computer software]. Lincolnwood, IL: Scientific Software International, Inc.

    Google Scholar 

  68. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–170). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  69. Vansteelandt, K. (2000). Formal models for contextualized personality psychology. Unpublished doctoral dissertation, K.U. Leuven, Belgium.

    Google Scholar 

  70. Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17, 113–144.

    Article  Google Scholar 

  71. Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479–498.

    Article  Google Scholar 

  72. Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to David Magis.

Additional information

This research was financially supported by the Belgian Federal Science Policy (Funds IAP/P6/03), the Research Fund GOA/2005/04 of the K.U. Leuven, Belgium, a doctoral grant “Bourse à la mobilité (hors Québec) pour l’intégration à la communauté scientifique en éducation” of the UQAM, Canada, and a postdoctoral grant “Chargé de recherches” of the National Funds for Scientific Research (FNRS), Belgium.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Magis, D., Béland, S., Tuerlinckx, F. et al. A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods 42, 847–862 (2010). https://doi.org/10.3758/BRM.42.3.847

Download citation

Keywords

  • Differential Item Functioning
  • Item Response Theory
  • Item Parameter
  • Item Response Theory Model
  • Item Response Model