Health Services and Outcomes Research Methodology

, Volume 9, Issue 3, pp 145–161 | Cite as

A non-parametric Bayesian diagnostic for detecting differential item functioning in IRT models

  • Mark E. Glickman
  • Pradipta Seal
  • Susan V. Eisen


Differential item functioning (DIF) in tests and multi-item surveys occurs when a lack of conditional independence exists between the response to one or more items and membership to a particular group, given equal levels of proficiency. We develop an approach to detecting DIF in the context of item response theory (IRT) models based on computing a diagnostic which is the posterior mean of a p-value. IRT models are fit in a Bayesian framework, and simulated proficiency parameters from the posterior distribution are retained. Monte Carlo estimates of the p-value diagnostic are then computed by comparing the fit of nonparametric regressions of item responses on simulated proficiency parameters and group membership. Some properties of our approach are examined through a simulation experiment. We apply our method to the analysis of responses from two separate studies to the BASIS-24, a widely used self-report mental health assessment instrument, to examine DIF between the English and Spanish-translated version of the survey.


Bayesian modeling Conditional independence Mental health outcome Model diagnostics Patient surveys 


  1. Angoff, W.H.: Use of difficulty and discrimination indices for detecting item bias. In: Berk, R.A. (ed.) Handbook of Methods for Detecting Test Bias, pp. 96–116. Johns Hopkins University Press, Baltimore, MD (1982)Google Scholar
  2. Beguin, A.A., Glas, C.A.W.: MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika 66, 541–562 (2001)CrossRefGoogle Scholar
  3. Birnbaum, A.: Some latent trait models and their use in inferring an examinee’s ability. In: Lord, F.M., Novick, M.R. (eds.) Statistical Theories of Mental Test Scores, pp. 396–479. Addison-Wesley, Reading, MA (1968)Google Scholar
  4. Bradlow, E.T., Wainer, H., Wang, X.: A Bayesian random effects model for testlets. Psychometrika 64, 153–168 (1999)CrossRefGoogle Scholar
  5. Cauffman, E., MacIntosh, R.: A Rasch differential item functioning analysis of the Massachusetts youth screening instrument. Educ. Psychol. Meas. 66(3), 502–521 (2006)CrossRefGoogle Scholar
  6. Cortés, D.E., Gerena, M., Canino, G., Aguilar-Gaxiola, S., Febo, V., Magaña, C., Soto, J., Eisen, S.V.: Translation and cultural adaptation of a mental health outcome measure: the BASIS-R. Cult. Med. Psychiatry 31(1), 25–49 (2007)PubMedCrossRefGoogle Scholar
  7. Eisen, S.V., Dill, D.L., Grob, M.C.: Reliability and validity of a brief patient-reported instrument for psychiatric outcome evaluation. Hosp. Community Psychiatry 45, 242–247 (1994)PubMedGoogle Scholar
  8. Eisen, S.V., Normand, S.L., Belanger, A.J., Spiro, A., Esch, D.: The revised Behavior and Symptom Identification Scale (BASIS-R). Med. Care 42(12), 1230–1241 (2004)PubMedCrossRefGoogle Scholar
  9. Eisen, S.V., Gerena, M., Ranganathan, G., Esch, D., Idiculla, T.: Reliability and validity of the BASIS-24 mental health survey for whites, African-Americans, and Latinos. J. Behav. Health Ser. R. 33(3), 304–323 (2006)CrossRefGoogle Scholar
  10. Eisen, S.V., Seal, P., Glickman, M.E., Cortés, D.E., Gerena, M.G., Aguilar-Gaxiola, S., Febo, V., Soto, J., Magaña, C., Canino, G.: Psychometric properties of the Spanish BASIS-24 mental health survey. J. Behav. Health Ser. R. (2009). doi:10.1007/s11414-009-9170-6
  11. Fox, J.P., Glas, C.A.W.: Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika 66, 269–286 (2001)CrossRefGoogle Scholar
  12. Gardner, W., Kelleher, K., Pajer, K.: Multidimensional adaptive testing for mental health problems in primary care. Med. Care 40, 812–823 (2002)PubMedCrossRefGoogle Scholar
  13. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge University Press, New York (2007)Google Scholar
  14. Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–511 (1992)CrossRefGoogle Scholar
  15. Gelman, A., Meng, X.L., Stern, H.S.: Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–807 (1996)Google Scholar
  16. Geltman, D., Chang, G.: Hallucinations in Latino psychiatric outpatients: a preliminary investigation. Gen. Hosp. Psychiatry 26(2), 153–157 (2004)PubMedCrossRefGoogle Scholar
  17. Glas, C.A.W.: Differential item functioning depending on general covariates. In: Boomsma, A., van Duijn, M.A.J., Snijders, T.A.B. (eds.) Essays on Item Response Theory, pp. 131–148. Springer, New York (2001)Google Scholar
  18. Glas, C.A.W., Meijer, R.: A Bayesian approach to person fit analysis in item response theory models. Appl. Psychol. Meas. 27(3), 217–233 (2003)CrossRefGoogle Scholar
  19. Guarnaccia, P.J., Guevara, L.M., González, G., Canino, G., Bird, H.R.: Cross cultural aspects of psychotic symptoms in Puerto Rico. Res. Comm. Ment. Health 7, 99–110 (1992)Google Scholar
  20. Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman and Hall, New York (1990)Google Scholar
  21. Hoijtink, H.: Conditional independence and differential item functioning in the two-parameter logistic model. In: Boomsma, A., van Duijn, M.A.J., Snijders, T.A.B. (eds.) Essays in Item Response Theory, pp. 109–129. Springer-Verlag, New York (2001)Google Scholar
  22. Holland, P.W., Thayer, D.T.: Differential item functioning and the Mantel-Haenszel procedure. In: Wainer H., Braun, H.I. (eds.) Test Validity, pp. 129–145. Erlbaum, Hillsdale, NJ (1988)Google Scholar
  23. Janssen, R., Tuerlinckx, F., Meulders, M., De Boeck, P.: A hierarchical IRT model for criterion-referenced measurement. J. Educ. Behav. Stat. 25, 285–306 (2000)Google Scholar
  24. Johnson, M.S., Sinharay, S.: Calibration of polytomous item families using Bayesian hierarchical modeling. Appl. Psychol. Meas. 29, 369–400 (2005)CrossRefGoogle Scholar
  25. Junker, B.W.: Conditional association, essential independence and monotone unidimensional item response models. Ann. Stat. 3, 1359–1378 (1993)CrossRefGoogle Scholar
  26. Kang, T., Cohen, A.S.: IRT model selection methods for dichotomous items. Appl. Psychol. Meas. 31, 331–358 (2007)CrossRefGoogle Scholar
  27. Kok, F.G., Mellenbergh, G.J., van der Flier, H.: Detecting experimentally induced item bias using the iterative logit method. J. Educ. Meas. 22, 295–303 (1985)CrossRefGoogle Scholar
  28. May, H.: A multilevel Bayesian item response theory method for scaling. J. Educ. Behav. Stat. 31, 63–79 (2006)CrossRefGoogle Scholar
  29. Muraki, E.: A generalized partial credit model: application of an EM algorithm. Appl. Psychol. Meas. 16, 159–176 (1992)CrossRefGoogle Scholar
  30. Pagano, I.S., Gotay, C.C.: Ethnic differential item functioning in the assessment of quality of life in cancer patients. Health Qual. Life Outcomes (2005). doi:10.1186/1477-7525-3-60
  31. Patz, R.J., Junker, B.W.: A straightforward approach to Markov chain Monte Carlo methods for item response models. J. Educ. Behav. Stat. 24, 146–178 (1999a)Google Scholar
  32. Patz, R.J., Junker, B.W.: Applications and extensions of MCMC in IRT: multiple types, missing data, and rated responses. J. Educ. Behav. Stat. 24, 342–366 (1999b)Google Scholar
  33. Perkins, A.J., Stump, T.E., Monahan, P.O., McHorney, C.A.: Assessment of differential item functioning for demographic comparisons in the MOS SF-36 health survey. Qual. Life Res. 15(3), 331–348 (2006)PubMedCrossRefGoogle Scholar
  34. R Development Core Team: R: A language and environment for statistical computing. (R Foundation for Statistical Computing), Vienna, Austria. (2008)
  35. Rosenthal, J.A.: Qualitative descriptors of strength of association and effect size. J. Soc. Service Res. 21(4), 37–59 (1996)CrossRefGoogle Scholar
  36. Samejima, F.: Estimation of latent trait ability using a response pattern of graded scores. Psychometrika Monograph, No. 17 (1969)Google Scholar
  37. Shealy, R., Stout, W.: A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika 58, 159–194 (1993)CrossRefGoogle Scholar
  38. Sinharay, S.: Assessing fit of unidimensional item response theory models using a Bayesian approach. J. Educ. Meas. 42(4), 375–394 (2005)CrossRefGoogle Scholar
  39. Spiegelhalter, D.J., Thomas, A., Best, N.G., Lunn, D.: WinBUGS 1.4 User Manual (Computer Program). MRC Biostatistics Unit, Cambridge, UK (2003)Google Scholar
  40. Swaminathan, H., Rogers, H.J.: Detecting differential item functioning using the logistic regression procedures. J. Educ. Meas. 27, 361–370 (1990)CrossRefGoogle Scholar
  41. Teresi, J.A.: Different approaches to differential item functioning in health applications: advantages, disadvantages and some neglected topics. Med. Care 44, 152–170 (2006)CrossRefGoogle Scholar
  42. Thissen, D., Steinberg, L., Wainer, H.: Detection of differential item functioning using the parameters of IRT models. In: Holland, P.W., Wainer, H. (eds.) Differential Item Functioning, pp. 67–113. Erlbaum, Hillsdale, NJ (1993)Google Scholar
  43. Thomas, A., O’Hara, B.O., Ligges, U., Sturtz, S.: OpenBUGS software package. R News 6, 12–17 (2006)Google Scholar
  44. Vega, W.A., Sribney, W.M., Miskimen, T.M., Escobar, J.I., Aguilar-Gaxiola, S.: Putative psychotic symptoms in the Mexican American population: prevalence and co-occurrence with psychiatric disorders. J. Nerv. Mental Dis. 194(7), 471–477 (2006)CrossRefGoogle Scholar
  45. Wainer, H., Bradlow, E.T., Wang, X.: Testlet Response Theory and its Applications, chapter 14, pp. 219–233. Cambridge University Press, New York (2007)Google Scholar
  46. Yee, T.W.: VGAM family functions for categorical data. Technical report, Department of Statistics, University of Auckland, New Zealand (2006)Google Scholar
  47. Yee, T.W., Wild, C.J.: Vector generalized additive models. J. R. Stat. Soc. B 58, 481–493 (1996)Google Scholar
  48. Zhang, J., Stout, W.: Conditional covariance structure for generalized compensatory multidimensional items. Psychometrika 64, 129–152 (1999)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Mark E. Glickman
    • 1
    • 2
  • Pradipta Seal
    • 3
  • Susan V. Eisen
    • 1
    • 2
  1. 1.Department of Health Policy and ManagementBoston University School of Public HealthBostonUSA
  2. 2.Center for Health Quality, Outcomes and Economics Research, a Veteran Administration Center of Excellence, Edith Nourse Rogers Memorial Hospital (152)BedfordUSA
  3. 3.Department of Mathematics and StatisticsBoston UniversityBostonUSA

Personalised recommendations