A non-parametric Bayesian diagnostic for detecting differential item functioning in IRT models
Authors
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s10742-009-0052-4
- Cite this article as:
- Glickman, M.E., Seal, P. & Eisen, S.V. Health Serv Outcomes Res Method (2009) 9: 145. doi:10.1007/s10742-009-0052-4
Abstract
Differential item functioning (DIF) in tests and multi-item surveys occurs when a lack of conditional independence exists between the response to one or more items and membership to a particular group, given equal levels of proficiency. We develop an approach to detecting DIF in the context of item response theory (IRT) models based on computing a diagnostic which is the posterior mean of a p-value. IRT models are fit in a Bayesian framework, and simulated proficiency parameters from the posterior distribution are retained. Monte Carlo estimates of the p-value diagnostic are then computed by comparing the fit of nonparametric regressions of item responses on simulated proficiency parameters and group membership. Some properties of our approach are examined through a simulation experiment. We apply our method to the analysis of responses from two separate studies to the BASIS-24, a widely used self-report mental health assessment instrument, to examine DIF between the English and Spanish-translated version of the survey.
Keywords
Bayesian modelingConditional independenceMental health outcomeModel diagnosticsPatient surveys1 Introduction
The assessment of differential item functioning (DIF) has become an integral part of determining the validity of standardized tests and multi-item surveys. In the context of tests, DIF occurs when people from different groups with the same ability have systematically different responses to specific test items. If, for example, a math test item has boys answering correctly more often than girls of equal ability because the subject of the item is on a topic more familiar to boys (e.g., sports), then the item is said to exhibit DIF and should be considered for modification or removal from the test. DIF of an item can therefore be understood as a lack of conditional independence between an item response and group membership (often gender or ethnicity) given the same latent ability or trait.
While differential item functioning has been applied most traditionally to educational tests, DIF studies are increasingly finding application to health surveys. The focus of this paper is on health surveys, so we will henceforth view a patient’s health as the latent trait in a DIF analysis. In a recent paper, Teresi (2006) has provided a review of statistical issues of DIF in health applications. Perkins et al. (2006) have examined DIF for items in a widely used health status instrument by age, education, race and gender groupings, and found many items to exhibit DIF. Pagano and Gotay (2005) have shown the presence of DIF by ethnic groups in a quality of life survey for cancer patients. Cauffman and MacIntosh (2006), in a recent mental health application, examine DIF by gender and ethnic groups of incarcerated juveniles in an instrument designed to identify mental health problems. As more health-related applications involve the detection of DIF to establish the validity of health surveys, the more crucial the statistical underpinnings for DIF detection continues to be.
Various methods for detection of DIF have been proposed over the past 25 years. The most commonly used approach is based on a Mantel–Haenszel analysis of the relationship between item responses and group membership conditional on an observed measure of ability (Holland and Thayer 1988), usually, in the context of tests, the total number of correctly answered items. Another common approach to detect DIF is to use log-linear or logistic models, as described in Kok et al. (1985) and Swaminathan and Rogers (1990). Recognizing that these methods involve conditioning on a measured surrogate of a latent trait, DIF detection has been more recently formulated in the context of item response theory (IRT) models. The advantage to the IRT framework is that latent trait is explicitly modeled as an unknown parameter to be inferred in the fitting process. Thissen et al. (1993) provide an overview of a set of methods that rely on fitting IRT models and then examining lack-of-fit statistics to assess the presence of DIF. These methods, however, require estimation of the latent trait and other model parameters (e.g., through maximum likelihood) on a likelihood surface which can be relatively flat due to the IRT model being highly parameterized. In such cases, the estimated latent trait parameters can be unreliable measures, even when evaluating likelihood-based quantities, and diagnostics based on these estimates can lead to overly optimistic conclusions.
To account for the uncertainty inherent in model inferences, several authors have explored DIF detection in IRT models within a Bayesian framework. One Bayesian approach is to determine the marginal posterior distribution of model parameters indicative of DIF. Wainer et al. (2007, ch 14), for example, suggest fitting an IRT model with a parameterization in which health outcomes depend explicitly on group membership—inferences about a group membership coefficient will reveal whether DIF has been detected for that item. The other main alternative has been to fit a single Bayesian model, and then perform posterior predictive checks (Gelman et al. 1996) as a means to diagnose item-level lack of model fit. Hoijtink (2001), for example, proposes examining the posterior predictive distribution of a standardized lack-of-fit statistic to compare against the statistic evaluated on the observed data. To examine person-specific fit diagnostics in IRT models, Glas and Meijer (2003) propose using posterior predictive checks on a variety of measures. Sinharay (2005) provides a more general discussion of posterior predictive checks for Bayesian IRT models beyond the context of DIF detection. These approaches have promise in allowing some freedom to choose a relevant lack-of-fit measure, but the motivation for choosing particular measures is not always compelling.
This paper proposes a diagnostic method for detecting DIF in a Bayesian IRT model that relies on examining the posterior distribution of an appropriately chosen measure. Our method shares similarities with the approach of posterior predictive checks in that we first fit a Bayesian IRT model to obtain the posterior distribution of all model parameters. We then construct a measure that directly addresses whether a conventional definition of differential item functioning has been satisfied, and subsequently summarize the posterior distribution of this measure. Allowing the measure to be a function of the latent health parameters permits the diagnostic to address both the uncertainty in model inferences and the increased flexibility to specify a measure that appropriately captures DIF. Our method, however, is not a posterior predictive check as our diagnostic is not averaged over the posterior predictive distribution.
The paper is organized as follows. We explain the construction of our DIF diagnostic in Sect. 2. The method is evaluated through a simulation analysis in Sect. 3. The approach is then applied to a study on detecting DIF in items between an English and Spanish version of a commonly used mental health survey, which is presented in Sect. 4. A discussion of the method and its limitations are outlined in Sect. 5.
2 A Bayesian method for detecting DIF
Our approach to detecting DIF for item j involves two steps. First, we fit an IRT model that may adjust for covariates \(\user2{x},\) but does not adjust for DIF group membership g, within a Bayesian framework via Markov chain Monte Carlo (MCMC) simulation from the posterior distribution, and retain simulated values from the marginal posterior distribution of the \(\tilde{\theta}_i.\) Second, based on the results of the fitted model, we check whether the Y_{j} are conditionally independent of g given \(\tilde{\theta}.\) More specifically, for each simulated vector of health parameters \({\tilde{\varvec{\theta}}} = (\tilde{\theta}_1,\ldots,\tilde{\theta}_n),\) we calculate the p-value for a likelihood ratio test comparing a flexible, possibly non-parametric, regression model for Y_{j} as a function of \({\tilde{\varvec{\theta}}}\) and g to a smaller model for Y_{j} only as a function of \({\tilde{\varvec{\theta}}}.\) We assume that the choice of flexible models results in a likelihood ratio test statistic that is asymptotically χ^{2}-distributed, following classical theory. The average of these p-values across the simulated vectors of \({\tilde{\varvec{\theta}}}\) is a Monte Carlo estimate of the posterior mean p-value. Because each individual likelihood ratio statistic is constructed to have a p-value that is approximately uniform under the model that does not include g, the resulting posterior mean p-value is also calibrated to be approximately uniform. The reason is that the likelihood ratio statistic is being applied to the comparison of two flexible regressions, making it irrelevant that each Monte Carlo draw of \({\tilde{\varvec{\theta}}}\) is simulated from the marginal posterior distribution of the IRT model. This second step of the algorithm is applied repeatedly for each item in the test. We discuss these two separate steps of our approach in detail below.
Bayesian fitting of IRT models is becoming increasingly commonplace arguably due to the increased ease of implementation of the fitting algorithms. In a pair of papers, Patz and Junker (1999a, b) lay out a general approach for implementing an MCMC algorithm for posterior sampling in the context of general IRT models. Other recent examples of Bayesian IRT modeling include Bradlow et al. (1999), Janssen et al. (2000), Beguin and Glas (2001), Fox and Glas (2001), Johnson and Sinharay (2005), and Wainer et al. (2007, ch 14). Rather than determining analytically the conditional posterior distributions necessary for MCMC simulation, publicly available Bayesian software such as WinBUGS (Spiegelhalter et al. 2003) and OpenBUGS (Thomas et al. 2006) allows for straightforward implementation of many IRT models. Recent examples of the use of WinBUGS in fitting IRT models include May (2006) who uses WinBUGS to fit multilevel IRT models, and Kang and Cohen (2007) who use WinBUGS in comparing methods of fit to various IRT models.
Our approach can be contrasted with that of Wainer et al. (2007, ch 14) who also develop a method of diagnosing DIF in a Bayesian model. To identify whether item j evidences DIF, their approach is essentially equivalent to fitting a Bayesian IRT model in (2) in which the parameter β_{j} is replaced by (β_{j}(1−g_{i}) + β_{j}^{*}g_{i}), where β_{j} and β_{j}^{*} are the item difficulty parameters for the reference and focal groups, respectively. Posterior inferences about P(|β_{j} − β_{j}^{*}| > 0) provide evidence of DIF for item j. The authors note that their procedure is computationally intensive, as separate models need to be fit for each DIF analysis of an item. They suggest a screening procedure in which the Mantel–Haenszel test identifies candidate items for DIF study under the Bayesian procedure.
Our approach is also similar to that of Hoijtink (2001), but can also be contrasted in several respects. The approach of Hoijtink more closely follows Gelman et al. (1996) in that the diagnostic DIF statistic (which itself is a standardized fit measure) is a function of observables, and that the posterior predictive p-value is computed based on comparing the statistic evaluated on the observed data to the posterior distribution of the statistic from MCMC posterior predictive simulations.
Our method, in contrast, has several features that make it an appealing alternative to these two. First, unlike the Wainer et al. method, our approach requires fitting only one IRT model rather than one model per item, so that the Bayesian model fitting computation is confined to one analysis. Second, while Wainer et al. consider an additive effect (on the logit scale) of g_{i}, and Hoijtink propose a fit measure based on a crude surrogate of the latent trait (namely, for detecting DIF on item j, the sum of the scores for all other items), our method recognizes the possibility of a more complicated relationship, for example an interaction between the \(\tilde{\theta}_i\) and g_{i} through a non-parametric relationship with Y_{ij}. Third, our method does not require specifying the focal and reference groups prior to fitting the IRT model. Once posterior simulations of the \(\tilde{\theta}_i\) have been obtained, a number of DIF analyses can be performed depending on dichotomies of interest. Finally, our measure is self-calibrated to have an interpretation as following a uniform distribution, so that the extra computation usually needed to obtain a reference distribution in a posterior predictive check is unnecessary.
3 Examination of method via simulation
To evaluate our approach in detecting DIF, we performed a small simulation experiment. Because each IRT model fit with MCMC posterior simulation and subsequent posterior mean p-value calculation can be computationally prohibitive, our simulation analyses are limited and intend only to provide a modest study of how various factors influence the ability of our method to assess DIF.
- 1.
the number of respondents, N (set to 150, 300, or 900)
- 2.
the number of items, J (set to either 10 or 30)
- 3.
the fraction of items, F, generated to exhibit DIF (set to either 10% or 20%)
Once the response data were generated, we then fit the 2-parameter logistic IRT model in (2) but without the g_{i} as part of the model specification. We assumed a prior distribution that factored into independent densities with components log α_{j} ∼ N(0, σ^{2}), β_{j} ∼ N(0, 100), and θ_{i} ∼ N(0, 1); such a constraint on the θ_{i} has been used previously, as in Wainer et al. (2007, ch 14). We also assumed a uniform prior density on σ between 0 and 100; this type of prior density for standard deviations in hierarchical models has been recommended by Gelman and Hill (2007). Each MCMC sampler, which was implemented in OpenBUGS (Thomas et al. 2006) called from within the R (R Development Core Team 2008) using the R2WinBUGS function, was run with two parallel chains consisting of a burn-in period of 2000 iterations, retaining a subsequent 1000 simulated sets of θ_{i} from each chain for DIF analysis. From initial exploration, 2000 iterations appeared to be a sufficient number for the sampler to converge. Then, for each j, and the vector of the θ_{i} from iteration m, we computed a likelihood ratio χ^{2}-based p-value comparing the fit of a smoothing spline model of Y_{ij} regressed on the simulated θ_{i}, and the fit of a smoothing spline model of the Y_{ij} regressed on the interaction of θ_{i} and g_{i} (essentially one smoothing spline for g_{i} = 0 and a second for g_{i} = 1). Determining the p-value for this comparison is described in Hastie and Tibshirani (1990), and implemented with the “gam” function in R. The average of the 2,000 p-values is the Monte Carlo estimate of the posterior mean p-value.
Summaries of the simulation experiment
N | J | F | DIF p-values | True and false positive rates | |||||
---|---|---|---|---|---|---|---|---|---|
Mean | 10% | 90% | TPR | FPR | TPR(*) | FPR(*) | |||
150 | 10 | 0.1 | 0.1527 | 0.0120 | 0.3040 | 0.5000 | 0.0000 | 0.0000 | 0.0000 |
150 | 10 | 0.2 | 0.0874 | 0.0016 | 0.1955 | 0.5500 | 0.0125 | 0.2000 | 0.0000 |
150 | 30 | 0.1 | 0.1380 | 0.0018 | 0.3016 | 0.5172 | 0.0444 | 0.1034 | 0.0037 |
150 | 30 | 0.2 | 0.1537 | 0.0041 | 0.5423 | 0.4833 | 0.0542 | 0.0833 | 0.0042 |
300 | 10 | 0.1 | 0.0960 | 0.0000 | 0.2097 | 0.7000 | 0.0000 | 0.5000 | 0.0000 |
300 | 10 | 0.2 | 0.0565 | 0.0002 | 0.1891 | 0.7500 | 0.0125 | 0.4500 | 0.0125 |
300 | 30 | 0.1 | 0.0290 | 0.0001 | 0.0796 | 0.8667 | 0.0222 | 0.5333 | 0.0037 |
300 | 30 | 0.2 | 0.0290 | 0.0001 | 0.0807 | 0.8000 | 0.0417 | 0.2833 | 0.0042 |
900 | 10 | 0.1 | 0.0007 | 0.0000 | 0.0010 | 1.0000 | 0.0444 | 0.9000 | 0.0000 |
900 | 10 | 0.2 | 0.0001 | 0.0000 | 0.0001 | 1.0000 | 0.0625 | 1.0000 | 0.0125 |
900 | 30 | 0.1 | 0.0001 | 0.0000 | 0.0002 | 1.0000 | 0.0333 | 0.9667 | 0.0000 |
900 | 30 | 0.2 | 0.0023 | 0.0000 | 0.0001 | 0.9833 | 0.1208 | 0.9833 | 0.0042 |
4 Application to a mental health survey
We applied our method to examine DIF between an English and Spanish version of the Behavior and Symptom Identification Scale (BASIS-24), a commonly used mental health self-report instrument, for two Latino cohorts enrolled in mental health or substance abuse programs. The original 32-item BASIS instrument was developed in 1984, and was designed to be used as a mental health status measure from a patient’s perspective for the outcome of mental health treatment (Eisen et al. 1994). Eisen et al. (2004) developed a revised instrument, the BASIS-24, containing 24 items, which is the focus of the current study. Reliability and validity of the BASIS-24 among Latinos was verified in Eisen et al. (2006) for the English version of the instrument, and in Cortés et al. (2007) and Eisen et al. (2009) for the Spanish translation.
Summaries of patient sample stratified by cohort
Sample characteristic | Spanish cohort (n = 594) | English cohort (n = 370) |
---|---|---|
Patient status | ||
Inpatient | 48 | 30 |
Outpatient | 52 | 70 |
Age (years) | ||
Age < 25 | 15 | 22 |
25 ≤ Age < 35 | 24 | 39 |
35 ≤ Age < 45 | 25 | 24 |
45 ≤ Age < 55 | 25 | 13 |
55 ≤ Age | 10 | 3 |
Gender | ||
Male | 44 | 50 |
Female | 56 | 50 |
Educational level | ||
4th grade–8th grade | 31 | 7 |
8th grade–12th grade | 30 | 27 |
High school graduate | 14 | 32 |
Some college | 18 | 23 |
4-year college graduate | 6 | 11 |
Not recorded | 2 | 1 |
Primary diagnosis | ||
Schizophrenia/Schizoaffective disorder | 22 | 11 |
Depressive disorder | 43 | 20 |
Bipolar disorder | 9 | 7 |
Alcohol/drug use order | 6 | 24 |
Anxiety disorder and others | 14 | 11 |
Not recorded | 6 | 28 |
BASIS-24 items in English and Spanish
English | Spanish |
---|---|
During the past week, how much difficulty did you have... | Durante la semana pasada, ¿ Qué tan difícil fue para usted... |
1. Managing your day-to-day life? | 1. hacerse cargo de su vida diaria? |
2. Coping with problems in your life? | 2. enfrentar los problemas de su vida? |
3. Concentrating? | 3. concentrarse? |
During the past week, how much of the time did you... | Durante la semana pasada, ¿ Con cuánta frecuencia... |
4. Get along with people in your family? | 4. se llevó bien con sus familiares? |
5. Get along with people outside your family? | 5. se llevó bien con personas que no son familiares suyos? |
6. Get along well in social situations? | 6. se llevó bien in situaciones sociales? |
7. Feel close to another person? | 7. se sintió cercano(a)a alguna otra persona? |
8. Feel like you had someone to turn to if you needed help? | 8. sintió que tenía alguien con quien contar si necesitaba ayuda? |
9. Feel confident in yourself? | 9. se sintió de sí mismo(a)? |
10. Feel sad or depressed? | 10. se sintió triste o deprimido(a)? |
11. Think about ending your life? | 11. pensó en quitarse la vida? |
12. Feel nervous? | 12. sintió nervioso(a)? |
During the past week, how often did you... | Durante la semana pasada, ¿ Que tan a menudo... |
13. Have thoughts racing through your head? | 13. pensó muchas cosas muy rápido y todas la vez? |
14. Think you had special powers? | 14. pensó que tenía poderes especiales que otras personas no tienen? |
15. Hear voices or see things? | 15. oyó voces o vio cosas que otras personas no oyeron o vieron? |
16. Think people were watching you? | 16. creyó que las personas lo/la estaban vigilando? |
17. Think people were against you? | 17. creyó que la gente estaba en contra suya? |
18. Have mood swings? | 18. tuvo cambios inesperados de ánimo? |
19. Feel short-tempered? | 19. se sintió irritable? |
20. Think about hurting yourself? | 20. pensó hacerse daño? |
21. Did you have an urge to drink alcohol or take street drugs? | 21. tuvo muchas ganas de tomar alcohol o de usar drogas? |
22. Did anyone talk to you about your drinking or drug use? | 22. alguien le dijo algo sobre su uso de alcohol o drogas? |
23. Did you try to hide your drinking or drug use? | 23. trató de esconder su uso de alcohol o drogas? |
24. Did you have problems from your drinking or drug use? | 24. tuvo problemas debido a su uso de alcohol o drogas? |
Because our IRT model was highly parameterized, several modeling restrictions and simplifications were made before implementing the MCMC posterior sampler. First, as in the simulation analyses, we assumed exchangeable prior density components, \(\tilde{\theta}_i \sim N(0, 1).\) Secondly, to properly identify covariate effects and to avoid unnecessary correlations among the covariate parameters, all individual covariate effects were constrained to sum to 0. Furthermore, the conditional posterior distribution of the individual β_{jk} given the remaining parameters were constrained to be sampled from a range limited by the adjacent parameter values, β_{j,k-1} and β_{j,k+1}. Diffuse but proper prior density components were assumed for all model parameters.
An MCMC sampler for the IRT model was implemented in OpenBUGS. Two parallel samplers were run for a burn-in period of 2000 iterations, after which the samplers were diagnosed to have converged through the examination of trace plots of various model parameters and through the examination of diagnostics such as the potential scale reduction statistic (Gelman and Rubin 1992). Simulated values of the \(\tilde{\theta}_i\) were saved for the next 1,000 iterations in each chain, resulting in 2,000 simulated sets of parameter values. For the subsequent discussion, let \(\tilde{\theta}_i^{(m)}\) denote the m-th iteration of \(\tilde{\theta}_i, m=1,\ldots,2000,\) from the MCMC sampler.
Results of the DIF analyses of the BASIS-24 responses
BASIS-24 Item | Posterior p-values | Likelihood p-values | ||||
---|---|---|---|---|---|---|
Scenario A | Scenario B | Scenario C | Scenario A | Scenario B | Scenario C | |
1. | 0.3211 | 0.2064 | 0.3054 | 0.1234 | 0.1132 | 0.1135 |
2. | 0.0089 | 0.0191 | 0.0031 | 0.5613 | 0.0109 | 0.0044 |
3. | 0.0523 | 0.1861 | 0.1724 | 0.0718 | 0.1866 | 0.3300 |
4. | 0.0454 | 0.0144 | \(\fbox{0.0013}\) | \(\fbox{0.0006}\) | \(\fbox{0.0012}\) | \(\fbox{0.0003}\) |
5. | 0.0062 | 0.3122 | 0.1024 | 0.0107 | 0.1981 | 0.0237 |
6. | 0.0140 | 0.3386 | 0.2127 | \(\fbox{0.0003}\) | \(\fbox{0.0015}\) | 0.0327 |
7. | \(\fbox{0.0005}\) | 0.0605 | 0.0101 | \(\fbox{0.0000}\) | 0.0053 | \(\fbox{0.0006}\) |
8. | 0.0283 | 0.5317 | 0.3751 | 0.0875 | 0.8029 | 0.6692 |
9. | \(\fbox{0.0011}\) | 0.0374 | 0.0198 | \(\fbox{0.0009}\) | 0.0772 | 0.0277 |
10. | 0.3010 | 0.2636 | 0.1038 | 0.0702 | 0.0578 | 0.0754 |
11. | 0.4968 | 0.6595 | 0.6353 | 0.0088 | 0.0420 | 0.2649 |
12. | 0.0395 | 0.1104 | 0.0393 | 0.0504 | 0.0866 | 0.0872 |
13. | 0.1563 | 0.2813 | 0.1734 | 0.1043 | 0.5369 | 0.2123 |
14. | 0.1141 | 0.0618 | 0.0653 | 0.6637 | 0.1106 | 0.1017 |
15. | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) |
16. | \(\fbox{0.0000}\) | 0.0162 | 0.0025 | \(\fbox{0.0000}\) | 0.2330 | 0.0282 |
17. | 0.0030 | 0.0127 | \(\fbox{0.0012}\) | 0.0039 | \(\fbox{0.0007}\) | 0.0022 |
18. | 0.1946 | 0.0695 | 0.0113 | 0.0444 | 0.0047 | \(\fbox{0.0019}\) |
19. | \(\fbox{0.0001}\) | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) | 0.0086 | \(\fbox{0.0000}\) | \(\fbox{0.0000}\) |
20. | \(\fbox{0.0000}\) | 0.1063 | 0.0156 | \(\fbox{0.0000}\) | 0.0054 | 0.0331 |
21. | 0.0034 | 0.1996 | 0.1351 | \(\fbox{0.0014}\) | 0.0600 | 0.0179 |
22. | 0.2738 | 0.2567 | 0.2220 | 0.4978 | 0.2315 | 0.2890 |
23. | 0.1414 | 0.3506 | 0.2861 | 0.8267 | 0.0897 | 0.1657 |
24. | \(\fbox{0.0000}\) | 0.0068 | \(\fbox{0.0003}\) | \(\fbox{0.0000}\) | \(\fbox{0.0002}\) | \(\fbox{0.0000}\) |
In addition to the three analyses above, we performed three likelihood-based analyses that paralleled the Bayesian analyses. Our likelihood analyses for item j involved replacing \(\tilde{\theta}_i^{(m)}\) with \(\bar{Y}_{i(-j)} = {\frac{1} {J-1}}\sum_{\ell \neq j} Y_{i\ell}\) in each instance. Thus, each of our likelihood-based p-values was the result of comparing a model that regressed Y_{ij} on \(\bar{Y}_{i(-j)}\) and g_{i}, assessing the significance of g_{i}. The use of \(\bar{Y}_{i^(-j)}\) as a proxy for the latent measure has been used conventionally, as in Junker (1993), Zhang and Stout (1999), and Hoijtink (2001). As our likelihood analyses parallel the Bayesian analyses, we examine the results of three sets of models depending on whether both Y_{ij} and \(\bar{Y}_{i(-j)}\) are treated as quantitative, whether both are treated as categorical variables, or whether Y_{ij} is categorical while \(\bar{Y}_{i(-j)}\) is quantitative. The results of these analyses are presented in the final three columns of Table 4.
The likelihood-based and posterior mean p-values in Table 4 reveal that the Bayesian diagnostic tends to be slightly more conservative than the likelihood-based diagnostic, as the latter tends to produce smaller values. Treating p-values that were significant at the 0.05 level, accounting for a Bonferroni adjustment of 24 items (that is, p-values that were less than 0.05/24 = 0.0021), as evidence of DIF between the English and Spanish versions of the BASIS-24, a greater number of items were flagged by the likelihood-based method. These p-values are highlighted on Table 4. Not surprisingly as well, Scenario B generally results in the largest p-values among the three modeling scenarios because both the Y_{ij} and the health measure are treated as categorical variables, while Scenario A tends to produce the most significant p-values as both the Y_{ij} and the health measure are modeled as quantitative variables. Scenario B of the likelihood analyses corresponds to the most common procedure involving log-linear models, and results in identifying six items exhibiting DIF. We suggest that Scenario C of the Bayesian approach, which models the Y_{ij} as multinomial and incorporates the effect of \(\tilde{\theta}_i\) as a smoothing spline relationship, is the most consistent with modeling assumptions. This particular posterior mean p-value identifies five items as evidencing DIF, and these are a subset of the six identified in Scenario B of the likelihood analysis. It is interesting to note that the BASIS-24 item that is identified in the likelihood analysis to exhibit DIF but not in the Bayesian analysis (item 6) has markedly differently p-values.
Considering the five items exhibiting DIF, several points are noteworthy. Items 15 and 17 (“hear voices or see things,” and “think people were watching you”) are both part of the psychotic symptoms domain. The DIF found for some psychotic symptoms is consistent with other reports in the literature suggesting that psychotic symptoms such as hearing voices or seeing things may reflect Latino cultural or spiritual beliefs rather than signs and symptoms of psychotic disorders (Geltman et al. 2004; Guarnaccia et al. 1992; Vega et al. 2006). One item exhibiting DIF (item 19, feel short-tempered), proved especially difficult to translate. There is no Spanish equivalent to the English term “short-tempered.” The closest approximation was the Spanish word “irritable,” which translates to irritable in English. Consequently, DIF may have occurred due to the inadequacy of the translation of this term. Reasons for DIF on the remaining two items (getting along with people in your family and having problems from drinking or drug use) are unclear, as there appeared to be no difficulty with translation, and no obvious cultural influences on the understanding of these areas. Further research is needed to determine whether DIF on these items can be accounted for by other factors such as acculturation, education or other variables.
5 Discussion
The method described in this paper to detect DIF in multi-item health surveys is both a flexible and computationally feasible approach compared with alternative methods. Our method relies on fitting a single Bayesian IRT model and saving Monte Carlo simulated health parameters from the fit, followed by performing a separate analysis that examines whether the DIF grouping variable is predictive of survey responses beyond the health parameters. Each of these steps can be implemented using standard statistical software. An attractive feature of our approach is that it explicitly incorporates the uncertainty in the latent health measure in detecting DIF through repeated evaluations of the likelihood ratio p-value averaged over the Monte Carlo simulated vectors of \({\tilde{\varvec{\theta}}}.\)
An important difference between our approach and more conventional approaches is that, because we fit an IRT model before carrying out DIF diagnosis, inferences about the latent health status are formed using information from all item responses, not excluding the item about which DIF detection is being performed. This is mainly due to construction; our method allows for examining DIF on a variety of groupings after the IRT model has been fit. At worst, incorporating information from all responses in the IRT model might result in slightly more conservative inferences about DIF for each item, but this small loss in efficiency is offset by a gain in computational simplicity through the need to fit only one IRT model. However, based on the simulation analyses, it appears that a combination of larger data sets along with a large fraction of items with DIF can increase the false positive rate of DIF detection because the health parameters are not inferred correctly.
Another notable feature of our two-step approach is that the relationship of the response, Y_{ij}, and the latent health measure, \(\tilde{\theta}_i,\) in the IRT model is patently different from the more flexible relationship assumed in the posterior mean p-value computation. The reason for this approach is that detecting DIF is a diagnostic procedure that uses the \(\tilde{\theta}_i\) as a proxy for latent health rather than specifically as an IRT model parameter, so that the approach to assess conditional independence between the response and DIF grouping can treat \(\tilde{\theta}_i\) in a flexible relationship. In this manner, our approach has connections with the Mantel–Haenszel non-parametric approach.
Because our method separates model fitting and DIF assessment, many extensions to our approach are straightforward to implement. For example, assessing DIF as the comparison among more than two groups (i.e., treating g_{i} as a categorical variable with an arbitrary number of levels) poses no difficulties, as the likelihood ratio computation would simply incorporate g_{i} as a categorical variable appropriately. Differential test functioning, in which some or all items of a multi-item survey or test are combined as a weighted combination (or simply as an unweighted sum) to produce clinically meaningful survey summaries also pose no difficulties for our approach. After the IRT model is fit to the response data as usual, the likelihood ratio comparison of non-parametric regressions would then involve replacing individual items Y_{j} by subscale scores or entire survey scores, and posterior mean p-values would then be computed in the usual manner. Our method could also be extended to multi-dimensional IRT models (see Gardner et al. 2002, for a multidimensional extension of the Samejima model), in which θ_{i} is a vector-parameter; MCMC-simulated draws of the θ_{i} are retained, and the posterior mean p-value are computed as the comparison of the two non-parametric multiple regressions of the Y_{ij} on the θ_{i} alone and with the g_{i}.
Several limitations of our approach are worth noting. With great flexibility to choose a particular model to assess conditional independence (choice of categorizing variables, particular smoother for the \(\tilde{\theta}_i\)), the conclusions about items exhibiting DIF may depend heavily on the choice. In the BASIS-24 analysis, Table 4 shows that treating the responses as quantitative usually yields much lower posterior mean p-values than the categorical response models. The two categorical response DIF analyses have a greater degree of agreement in the conclusions, but in some cases the p-values can be on the order of a factor of 10 apart, or higher (e.g., BASIS-24 items 4 and 17). Also, our method (as with most other IRT approaches) relies heavily on the IRT model being a reasonably correct representation of the data, and being properly specified (e.g., correctly incorporating covariate information, correct parameterization of discrimination and difficulty parameters, etc.). In particular, most DIF diagnostics, including ours, assume that when evaluating a specific item, other items are free of DIF. This is not ever likely to be the case, so a tacit assumption is that the number of items where DIF may be problematic is minimal. On the positive side, model misspecification will likely lead to more uncertain posterior inferences about the \(\tilde{\theta}_i,\) so that the diagnostic analyses using posterior samples will in turn lead to insufficient evidence of DIF. Thus our method is protective of false positives in the event that IRT models are inappropriately specified. But with an IRT model that has undergone appropriate model diagnosis and criticism, our method for detecting DIF is worthy of consideration.
Acknowledgments
This research was supported by Grant R01 MH58240 from the National Institute of Mental Health and by the Veterans Administration Health Services Research and Development program.