, Volume 80, Issue 2, pp 317–340 | Cite as

A Statistical Test for Differential Item Pair Functioning

  • Timo M. BechgerEmail author
  • Gunter Maris


This paper presents an IRT-based statistical test for differential item functioning (DIF). The test is developed for items conforming to the Rasch (Probabilistic models for some intelligence and attainment tests, The Danish Institute of Educational Research, Copenhagen, 1960) model but we will outline its extension to more complex IRT models. Its difference from the existing procedures is that DIF is defined in terms of the relative difficulties of pairs of items and not in terms of the difficulties of individual items. The argument is that the difficulty of an item is not identified from the observations, whereas the relative difficulties are. This leads to a test that is closely related to Lord’s (Applications of item response theory to practical testing problems, Erlbaum, Hillsdale, 1980) test for item DIF albeit with a different and more correct interpretation. Illustrations with real and simulated data are provided.


item response theory differential item functioning identifiability Rasch model 



The authors wish to thank Norman Verhelst, Cees Glas, Paul De Boeck, Don Mellenbergh, and our former student Maarten Marsman for helpful comments and discussions.


  1. Ackerman, T. (1992). A didactic explanation of item bias, impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91.CrossRefGoogle Scholar
  2. Andersen, E. B. (1972). The numerical solution to a set of conditional estimation equations. Journal of the Royal Statistical Society, Series B, 34, 283–301.Google Scholar
  3. Andersen, E. B. (1973b). A goodness of fit test for the rasch model. Psychometrika, 38, 123–140.CrossRefGoogle Scholar
  4. Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69–81.CrossRefGoogle Scholar
  5. Angoff, W. H. (1982). Use of diffculty and discrimination indices for detecting item bias. In R. A. Berk (Ed.), Handbook of methods for detecting item bias (pp. 96–116). Baltimore: John Hopkins University Press.Google Scholar
  6. Angoff, W. H. (1993). Perspective on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3–24). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  7. Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view on DIF (R&D Report No. 2010–5). Arnhem: Cito.Google Scholar
  8. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading: Addison-Wesley.Google Scholar
  9. Bold, D. M. (2002). A monte carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113–141.CrossRefGoogle Scholar
  10. Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do test bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 397–413). Hillsdale, NJ: Lawrence Earlbaum.Google Scholar
  11. Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.CrossRefGoogle Scholar
  12. De Boeck, P. (2008). Random IRT models. Psychometrika, 73(4), 533–559.CrossRefGoogle Scholar
  13. DeMars, C. (2010). Type I error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70, 961972.CrossRefGoogle Scholar
  14. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  15. Engelhard, G. (1989). Accuracy of bias review judges in identifying teacher certification tests. Applied Measurement in Education, 3, 347–360.CrossRefGoogle Scholar
  16. Finch, H. W., & French, B. F. (2008). Anomalous type I error rates for identifying one type of differential item functioning in the presence of the other. Educational and Psychological Measurement, 68(5), 742–759.CrossRefGoogle Scholar
  17. Fischer, G. H. (1974). Einfuhrung in die theorie psychologischer tests. Bern: Verlag Hans Huber. (Introduction to the theory of psychological tests.).Google Scholar
  18. Fischer, G. H. (2007). Rasch models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 515–585). Amsterdam, The Netherlands: Elsevier.CrossRefGoogle Scholar
  19. Gabrielsen, A. (1978). Consistency and identifiability. Journal of Econometrics, 8, 261–263.CrossRefGoogle Scholar
  20. Gierl, M., Gotzmann, A., & Boughton, K. A. (2004). Performance of SIBTEST when the percentage of DIF items is large. Applied Measurement in Education, 17, 241–264.CrossRefGoogle Scholar
  21. Glas, C. A. W. (1989). Contributions to estimating and testing rasch models. Unpublished doctoral dissertation, Cito, Arnhem.Google Scholar
  22. Glas, C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8, 647–667.Google Scholar
  23. Glas, C. A., & Verhelst, N. D. (1995a). Testing the Rasch model. In G. H. Fischer & I. W. Mole- naar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 69–95). New-York: Spinger.CrossRefGoogle Scholar
  24. Glas, C. A. W., & Verhelst, N. D. (1995b). Tests of fit for polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 325–352). New-York: Spinger.CrossRefGoogle Scholar
  25. Gnanadesikan, R., & Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, 81–124.CrossRefGoogle Scholar
  26. Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-Haenszel null hypothesis. Psychometrika, 70, 497–516.CrossRefGoogle Scholar
  27. Holland, P., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129–145). Hillsdal, NJ: Lawrence Erlbaum Associates.Google Scholar
  28. Holm, S. (1979). A simple sequentially rejective multiple test procedur. Scandinavian Journal of Statistics, 6, 65–70.Google Scholar
  29. Jodoin, G. M., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349.CrossRefGoogle Scholar
  30. Lord, F. M. (1977). A study of item bias, using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology (pp. 19–29). Hillsdale, NJ: Lawrence Earl- baum Associates.Google Scholar
  31. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
  32. Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichtomous differential item functioning. Behavior Research Methods, 42, 847–862.PubMedCrossRefGoogle Scholar
  33. Magis, D., & Boeck, P. De. (2012). A robust outlier approach to prevent type i error inflation in differential item functioning. Educational and Psychological Measurement, 72, 291–311.CrossRefGoogle Scholar
  34. Magis, D., & Facon, B. (2013). Item purification does not always improve DIF detection: A coun- terexample with Angoff’s Delta plot. Journal of Educational and Psychological Measurement, 73, 293–311.CrossRefGoogle Scholar
  35. Mahalanobis, P. C. (1936). On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India (Vol. 2, pp. 49–55).Google Scholar
  36. Maris, G., & Bechger, T. M. (2007). Scoring open ended questions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 663–680). Amsterdam: Elsevier.CrossRefGoogle Scholar
  37. Maris, G., & Bechger, T. M. (2009). On interpreting the model parameters for the three parameter logistic model. Measurement, 7, 75–88.Google Scholar
  38. Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–175.CrossRefGoogle Scholar
  39. Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychomettrika, 57, 289–311.CrossRefGoogle Scholar
  40. Penfield, R. D., & Camelli, G. (2007). Differential item functioning and bias. In C. R. Rao & S. Sinharay (Eds.), Psychometrics (Vol. 26). Amsterdam: Elsevier.Google Scholar
  41. Ponocny, I. (2001). Nonparametric goodness-of-fit tests for the Rasch model. Psychometrika, 66, 437–460.CrossRefGoogle Scholar
  42. Raju, N. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502.CrossRefGoogle Scholar
  43. Raju, N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Pschological Measurement, 14, 197–207.CrossRefGoogle Scholar
  44. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New-York: Wiley.CrossRefGoogle Scholar
  45. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago, The University of Chicago Press).Google Scholar
  46. R Development Core Team. (2010). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from (ISBN 3-900051-07-0)
  47. San Martin, E., González, J., & Tuerlinckx, F. (2009). Identified parameters, parameters of interest and their relationships. Measurement, 7, 97–105.Google Scholar
  48. San Martin, E., & Quintana, F. (2002). Consistency and identifiability revisited. Brazilian Journal of Probability and Statistics, 16, 99–106.Google Scholar
  49. San Martin, E., & Roulin, J. M. (2013). Identifiability of parametric Rasch-type models. Journal of statistical planning and inference, 143, 116–130.CrossRefGoogle Scholar
  50. San Martín, E., González, J., & Tuerlinckx, F. (2014). On the unidentifiability of the fixed-effects 3PL model. Psychometrika, 78(2), 341–379.Google Scholar
  51. Shealy, R., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DIF as well as item bias/DIF. PSychometrika, 58, 159–194.CrossRefGoogle Scholar
  52. Soares, T. M., Concalves, F. B., & Gamerman, D. (2009). An integrated bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3), 348–377.CrossRefGoogle Scholar
  53. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with CFA and IRT: Towards a unified strategy. Journal of Applied Psychology, 19, 1292–1306.CrossRefGoogle Scholar
  54. Stout, W. (2002). Psychometrics: From practice to theory and back. Psychometrika, 67, 485–518.CrossRefGoogle Scholar
  55. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.CrossRefGoogle Scholar
  56. Tay, L., Newman, D. A., & Vermunt, J. K. (2011). Using mixed-measurement item response theory with covariates (mm-irt-c) to ascertain observed and unobserved measurement equivalence. Organizational Research Method, 14, 147–155.CrossRefGoogle Scholar
  57. Teresi, J. A. (2006). Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care, 44, 152–170.CrossRefGoogle Scholar
  58. Thissen, D. (2001). IRTLRDIF v2.0b: Software for the comutation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Documentation for computer program [Computer software manual]. Chapel Hill.Google Scholar
  59. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group difference in trace lines. In H. Wainer & H. Braun (Eds.), Validity. Hillsdale, NJ: Lawrence Erlbaum Associates. Google Scholar
  60. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–114). Hillsdale, NJ: Lawrence Earlbaum.Google Scholar
  61. Thurman, C. J. (2009). A monte carlo study investigating the in uence of item discrimination, category intersection parameters, and differential item functioning in polytomous items. Unpublished doctoral dissertation, Georgia State University.Google Scholar
  62. Van der Flier, H., Mellenbergh, G. J., Adèr, H. J., & Wijn, M. (1984). An iterative bias detection method. Journal of Educational Measurement, 21, 131–145.CrossRefGoogle Scholar
  63. Verhelst, N. D. (1993). On the standard errors of parameter estimators in the Rasch model (Mea- surement and Research Department Reports No. 93–1). Arnhem: Cito.Google Scholar
  64. Verhelst, N. D. (2008). An effcient MCMC-algorithm to sample binary matrices with fixed marginals. Psychometrika, 73, 705–728.CrossRefGoogle Scholar
  65. Verhelst, N. D., & Eggen, T. J. H. M. (1989). Psychometrische en statistische aspecten van peiling- sonderzoek (PPON-rapport No. 4). Arnhem: CITO.Google Scholar
  66. Verhelst, N. D., Glas, C. A. W., & van der Sluis, A. (1984). Estimation problems in the Rasch model: The basic symmetric functions. Computational Statistics Quarterly, 1(3), 245–262.Google Scholar
  67. Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1994). OPLM: Computer program and manual [Computer software manual]. Arnhem.Google Scholar
  68. Verhelst, N. D., Hartzinger, R., & Mair, P. (2007). The Rasch sampler. Journal of Statistical Software, 20, 1–14.Google Scholar
  69. Verhelst, N. D., & Verstralen, H. H. F. M. (2008). Some considerations on the partial credit model. Psicológica, 29, 229–254.Google Scholar
  70. von Davier, M., & von Davier, A. A. (2007). A uniffed approach to IRT scale linking and scale transformations. Methodology, 3(3), 1614–1881.Google Scholar
  71. Wald, A. (1943). tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482.CrossRefGoogle Scholar
  72. Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential item functioning within the family of rasch models. The Journal of Experimental Education, 72(3), 221–261.CrossRefGoogle Scholar
  73. Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 74, 1–22.Google Scholar
  74. Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning whith the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498.CrossRefGoogle Scholar
  75. Williams, V. S. L. (1997). The ”unbiased” anchor: Bridging the gap between DIF and item bias. Applied Measurement in Education, 10(3), 253–267.CrossRefGoogle Scholar
  76. Wilson, M., & Adams, R. (1995). Rasch models for item bundles. Psychometrika, 60, 181–198.CrossRefGoogle Scholar
  77. Wright, B. D., Mead, R., & Draba, R. (1976). Detecting and correcting item bias with a logistic response model. (Tech. Rep.). Department of Education, Statistical Laboratory: University of Chicago.Google Scholar
  78. Zwinderman, A. H. (1995). Pairwise parameter estimation in Rasch models. Applied Psychological Measurement, 19, 369–375.CrossRefGoogle Scholar
  79. Zwitser, R., & Maris, G. (2013). Conditional statistical inference with multistage testing designs. Psychometrika. doi: 10.1007/s11336-013-9369-6.

Copyright information

© The Psychometric Society 2014

Authors and Affiliations

  1. 1.CitoArnhemThe Netherlands
  2. 2.University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations