Behavior Research Methods

, Volume 42, Issue 3, pp 847–862 | Cite as

A general framework and an R package for the detection of dichotomous differential item functioning

  • David MagisEmail author
  • Sébastien Béland
  • Francis Tuerlinckx
  • Paul De Boeck


Differential item functioning (DIF) is an important issue of interest in psychometrics and educational measurement. Several methods have been proposed in recent decades for identifying items that function differently between two or more groups of examinees. Starting from a framework for classifying DIF detection methods and from a comparative overview of the most traditional methods, an R package for nine methods, called difR, is presented. The commands and options are briefly described, and the package is illustrated through the analysis of a data set on verbal aggression.


Differential Item Functioning Item Response Theory Item Parameter Item Response Theory Model Item Response Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.CrossRefGoogle Scholar
  2. Agresti, A. (1990). Categorical data analysis. New York: Wiley.Google Scholar
  3. Aguerri, M. E., Galibert, M. S., Attorresi, H. F., & Marañón, P. P. (2009). Erroneous detection of nonuniform DIF using the Breslow— Day test in a short test. Quality & Quantity, 43, 35–44.CrossRefGoogle Scholar
  4. Angoff, W. H., & Ford, S. F. (1973). Item—race interaction on a test of scholastic aptitude. Journal of Educational Measurement, 10, 95–106.CrossRefGoogle Scholar
  5. Bates, D., & Maechler, M. (2009). lme4: Linear mixed-effects models using S4 classes. R package Version 0.999375-32. Available from Scholar
  6. Berk, R. A. (1982). Handbook of methods for detecting test bias. Baltimore: Johns Hopkins University Press.Google Scholar
  7. Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research: Vol. 1. The analysis of case—control studies (Scientific Publication No. 32). Lyon, France: International Agency for Research on Cancer.Google Scholar
  8. Breslow, N. E., & Liang, K. Y. (1982). The variance of the Mantel— Haenszel estimator. Biometrics, 38, 943–952.CrossRefGoogle Scholar
  9. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.Google Scholar
  10. Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.CrossRefGoogle Scholar
  11. Cardall, C., & Coffman, W. E. (1964). A method for comparing the performance of different groups on the items in a test (Research Bulletin 64–61). Princeton, NJ: Educational Testing Service.Google Scholar
  12. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement, 17, 31–44.Google Scholar
  13. Clauser, B. E., Mazor, K. M., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6, 269–279.CrossRefGoogle Scholar
  14. Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational & Psychological Measurement, 28, 61–75.CrossRefGoogle Scholar
  15. Cook, L. L., & Eignor, D. R. (1991). NCME instructional module: IRT equating methods. Educational Measurement, 10, 37–45.Google Scholar
  16. De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.Google Scholar
  17. Dorans, N. J. (1989). Two new approaches to assessing differential item functioning. Standardization and the Mantel-Haenszel method. Applied Measurement in Education, 2, 217–233.CrossRefGoogle Scholar
  18. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Erlbaum.Google Scholar
  19. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355–368.CrossRefGoogle Scholar
  20. Dorans, N. J., Schmitt, A. P., & Bleistein, C. A. (1992). The standardization approach to assessing comprehensive differential item functioning. Journal of Educational Measurement, 29, 309–319.CrossRefGoogle Scholar
  21. Fidalgo, Á. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research, 5, 43–53.Google Scholar
  22. Finch, W. H., & French, B. F. (2007). Detection of crossing differential item functioning: A comparison of four methods. Educational & Psychological Measurement, 67, 565–582.CrossRefGoogle Scholar
  23. Hanson, B. A. (1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational & Behavioral Statistics, 23, 244–253.Google Scholar
  24. Hauck, W. W. (1979). The large sample variance of the Mantel-Haenszel estimator of a common odds ratio. Biometrics, 35, 817–819.CrossRefGoogle Scholar
  25. Holland, P. W., & Thayer, D. T. (1985). An alternate definition of the ETS delta scale of item difficulty (Research Report RR-85-43). Princeton, NJ: Educational Testing Service.Google Scholar
  26. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Erlbaum.Google Scholar
  27. Ironson, G. H., & Subkoviak, M. J. (1979). A comparison of several methods of assessing item bias. Journal of Educational Measurement, 16, 209–225.CrossRefGoogle Scholar
  28. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.CrossRefGoogle Scholar
  29. Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT differential item functioning analysis. Applied Psychological Measurement, 16, 158.CrossRefGoogle Scholar
  30. Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261–276.CrossRefGoogle Scholar
  31. Lautenschlager, G. J., & Park, D.-G. (1988). IRT item bias detection procedures: Issues of model misspecification, robustness, and parameter linking. Applied Psychological Measurement, 12, 365–376.CrossRefGoogle Scholar
  32. Li, H.-H., & Stout, W. (1994). SIBTEST: A FORTRAN-V Program for Computing the Simultaneous Item Bias DIF Statistics [Computer program]. Urbana-Champaign, IL: University of Illinois, Department of Statistics.Google Scholar
  33. Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677.CrossRefGoogle Scholar
  34. Lord, F. M. (1976). A study of item bias, using item characteristic curve theory. Princeton, NJ: Educational Testing Service.Google Scholar
  35. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
  36. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748.PubMedGoogle Scholar
  37. Mazor, K. M., Clauser, B. E., & Hambleton, R. K. (1994). Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure. Educational & Psychological Measurement, 54, 284–291.CrossRefGoogle Scholar
  38. Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). New York: Springer.Google Scholar
  39. Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.CrossRefGoogle Scholar
  40. Mislevy, R. J., & Bock, R. D. (1984). BILOG: Item analysis and test scoring with binary logistic models [Computer program]. Mooresville, IN: Scientific Software.Google Scholar
  41. Mislevy, R. J., & Stocking, M. L. (1989). A consumer’s guide to LOGIST and BILOG. Applied Psychological Measurement, 13, 57–75.CrossRefGoogle Scholar
  42. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.CrossRefGoogle Scholar
  43. Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257–274.CrossRefGoogle Scholar
  44. Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Thousand Oaks, CA: Sage.Google Scholar
  45. Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures. Applied Measurement in Education, 14, 235–259.CrossRefGoogle Scholar
  46. Penfield, R. D. (2003). Applying the Breslow-Day test of trend in odds ratio heterogeneity to the analysis of nonuniform DIF. Alberta Journal of Educational Research, 49, 231–243.Google Scholar
  47. Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system. Applied Psychological Measurement, 29, 150–151.CrossRefGoogle Scholar
  48. Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Vol. 26. Psychometrics (pp. 125–167). Amsterdam: Elsevier.Google Scholar
  49. Philips, A., & Holland, P. W. (1987). Estimators of the variance of the Mantel-Haenszel log-odds-ratio estimate. Biometrics, 43, 425–431.CrossRefGoogle Scholar
  50. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.CrossRefGoogle Scholar
  51. Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197–207.CrossRefGoogle Scholar
  52. Raju, N. S. (1995). DFITPU: A FORTRAN program for calculating DIF/ DTF [Computer program]. Atlanta: Georgia Institute of Technology. R Development Core Team (2008). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.Google Scholar
  53. Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response analysis. Journal of Statistical Software, 17, 1–25.Google Scholar
  54. Robins, J., Breslow, N., & Greenland, S. (1986). Estimators of the Mantel-Haenszel variance consistent in both sparse data and largestrata limiting models. Biometrics, 42, 311–323.PubMedCrossRefGoogle Scholar
  55. Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A FORTRAN program for DIF analysis of dichotomously scored item response data [Computer program]. Amherst, MA: University of Massachusetts.Google Scholar
  56. Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33, 215–230.CrossRefGoogle Scholar
  57. Rudner, L. M., Getson, P. R., & Knight, D. L. (1980). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17, 1–10.CrossRefGoogle Scholar
  58. Scheuneman, J. (1979). A method of assessing bias in test items. Journal of Educational Measurement, 16, 143–152.CrossRefGoogle Scholar
  59. Shealy, R., & Stout, W. [F.] (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.CrossRefGoogle Scholar
  60. Shepard, L. [A.], Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational & Behavioral Statistics, 6, 317–375.CrossRefGoogle Scholar
  61. Smits, D. J. M., De Boeck, P., & Vansteelandt, K. (2004). The inhibition of verbally aggressive behaviour. European Journal of Personality, 18, 537–555.CrossRefGoogle Scholar
  62. Soares, T. M., Gonçalves, F. B., & Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational & Behavioral Statistics, 34, 348–377.CrossRefGoogle Scholar
  63. Somes, G. W. (1986). The generalized Mantel-Haenszel statistic. American Statistician, 40, 106–108.CrossRefGoogle Scholar
  64. Spielberger, C. D. (1988). State-Trait Anger Expression Inventory research edition: Professional manual. Odessa, FL: Psychological Assessment Resources.Google Scholar
  65. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.CrossRefGoogle Scholar
  66. Thissen, D. (2001). IRTLRDIF v.2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer software]. Chapel Hill: University of North Carolina, L. L. Thurstone Psychometric Laboratory.Google Scholar
  67. Thissen, D., Chen, W.-H., & Bock, R. D. (2003). MULTILOG 7 for Windows: Multiple-category item analysis and test scoring using item response theory [Computer software]. Lincolnwood, IL: Scientific Software International, Inc.Google Scholar
  68. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Test validity (pp. 147–170). Hillsdale, NJ: Erlbaum.Google Scholar
  69. Vansteelandt, K. (2000). Formal models for contextualized personality psychology. Unpublished doctoral dissertation, K.U. Leuven, Belgium.Google Scholar
  70. Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17, 113–144.CrossRefGoogle Scholar
  71. Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479–498.CrossRefGoogle Scholar
  72. Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.Google Scholar

Copyright information

© Psychonomic Society, Inc. 2010

Authors and Affiliations

  • David Magis
    • 1
    • 4
    Email author
  • Sébastien Béland
    • 2
  • Francis Tuerlinckx
    • 1
  • Paul De Boeck
    • 1
    • 3
  1. 1.Katholieke Universiteit LeuvenLeuvenBelgium
  2. 2.University of QuebecMontrealCanada
  3. 3.University of AmsterdamAmsterdamThe Netherlands
  4. 4.Department of MathematicsUniversity of LiègeLiègeBelgium

Personalised recommendations