Assessing the equivalence of Web-based and paper-and-pencil questionnaires using differential item and test functioning (DIF and DTF) analysis: a case of the Four-Dimensional Symptom Questionnaire (4DSQ)

Purpose Many paper-and-pencil (P&P) questionnaires have been migrated to electronic platforms. Differential item and test functioning (DIF and DTF) analysis constitutes a superior research design to assess measurement equivalence across modes of administration. The purpose of this study was to demonstrate an item response theory (IRT)-based DIF and DTF analysis to assess the measurement equivalence of a Web-based version and the original P&P format of the Four-Dimensional Symptom Questionnaire (4DSQ), measuring distress, depression, anxiety, and somatization. Methods The P&P group (n = 2031) and the Web group (n = 958) consisted of primary care psychology clients. Unidimensionality and local independence of the 4DSQ scales were examined using IRT and Yen’s Q3. Bifactor modeling was used to assess the scales’ essential unidimensionality. Measurement equivalence was assessed using IRT-based DIF analysis using a 3-stage approach: linking on the latent mean and variance, selection of anchor items, and DIF testing using the Wald test. DTF was evaluated by comparing expected scale scores as a function of the latent trait. Results The 4DSQ scales proved to be essentially unidimensional in both modalities. Five items, belonging to the distress and somatization scales, displayed small amounts of DIF. DTF analysis revealed that the impact of DIF on the scale level was negligible. Conclusions IRT-based DIF and DTF analysis is demonstrated as a way to assess the equivalence of Web-based and P&P questionnaire modalities. Data obtained with the Web-based 4DSQ are equivalent to data obtained with the P&P version. Electronic supplementary material The online version of this article (10.1007/s11136-018-1816-5) contains supplementary material, which is available to authorized users.


Introduction
Many questionnaires have been developed and validated as paper-and-pencil (P&P) questionnaires. However, over the past few decades, many of these questionnaires have increasingly been administered using electronic formats, in particular as Web-based questionnaires [1]. Advantages of data collection over the Internet include reduced administrative burden, prevention of item nonresponse, avoidance of data entry and coding errors, automatic application of skip patterns, and in many cases cost savings [1]. A Webbased questionnaire that has been adapted from a P&P instrument ought to produce data that are equivalent to the original P&P version [2]. Measurement equivalence means that a Web-based questionnaire measures the same construct in the same way as the original P&P questionnaire, and that, consequently, results obtained with a Webbased questionnaire can be interpreted in the same way as those obtained using the original P&P questionnaire. However, migration of a well-established P&P questionnaire to a Web-based platform does not guarantee that the Webbased instrument preserves the measurement properties of the original P&P questionnaire. Necessary modifications in layout, instructions, and sometimes item wording and response options might alter item response behavior. Therefore, it is recommended that measurement equivalence between a Web-based questionnaire and the original P&P questionnaire be supported by appropriate evidence [2]. Four reviews of such equivalence studies suggested that, in most instances, electronic questionnaires and P&P questionnaires produce equivalent results [3][4][5][6]. However, this is not always the case [4]. In this paper, we will demonstrate the use of modern psychometric methods to assess the equivalence across two modalities of a questionnaire. This is illustrated by analyzing the Web-based and P&P versions of the Four-Dimensional Symptom Questionnaire (4DSQ), a self-report questionnaire measuring distress, depression, anxiety, and somatization.
In 2009, the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) electronic patientreported outcomes (ePRO) Good Research Practices Task Force published recommendations on the evidence needed to support measurement equivalence between electronic and paper-based patient-reported outcome (PRO) measures [2]. The task force specifically recommended two types of study designs, the randomized parallel groups design and the randomized crossover design. In the former design, participants are randomly assigned to one of two study arms in which they complete either the P&P PRO or the corresponding ePRO. Mean scores can then be compared between groups. This is a fairly weak design to assess measurement equivalence [7]. In the latter design, participants are randomly assigned to one of two study arms in which they either first complete the P&P PRO and then the ePRO or the other way around. Then, in addition to comparing mean scores, the correlation between the P&P score and the ePRO score can be calculated. The correlation, however, is little informative about the true extent of equivalence because of measurement error and retest effects. Measurement error attenuates the correlation, making it difficult to assess the true equivalence. Retest effects may further aggravate the problem. Retest effects are thought to be due to memory effects and specific item features eliciting the same response in repeated measurements [8]. Retest effects are assumed to diminish with longer intervals between measurements. However, longer intervals carry the risk of the construct of interest changing in between measurements, leading to underestimation of the true correlation.
The research designs, discussed above, assess only a small aspect of true measurement equivalence because they fail to address equivalence of item-level responses [7,8]. Contemporary approaches to measurement equivalence employ differential item functioning (DIF) analysis [9]. Addressing equivalence of item-level information, DIF analysis has been used extensively to assess measurement equivalence across different age, gender, education, or ethnicity groups (e.g., [10]), or to assess the equivalence of different translations of a questionnaire (e.g., [11]). Whereas DIF analysis dates back to at least the 1980s [12], the method is relatively new in mode of administration equivalence research. In a non-systematic search, we identified only a dozen such studies (e.g., [7,8,[13][14][15]). The ISPOR ePRO good research practice task force report briefly mentioned DIF analysis as 'another approach' without giving the method much attention [2]. The recent meta-analysis by Rutherford et al. did not include any studies using DIF analysis [6].
The idea behind DIF analysis is that responses to the items of a questionnaire reflect the underlying dimension (or latent trait) that the questionnaire intends to measure, and that two versions of a questionnaire are equivalent when the corresponding items demonstrate the same itemtrait relationships. There are various approaches to DIF analysis including non-parametric (Mantel-Haenszel/ standardization) [16,17] and parametric approaches (ordinal logistic regression, item response theory, and structural equation modeling) [18][19][20]. In the present paper, we demonstrate the use of DIF analysis within the item response theory (IRT) framework by assessing measurement equivalence across a Web-based and the original P&P versions of the Four-Dimensional Symptom Questionnaire (4DSQ).

Study samples and design
DIF analysis compares the measurement properties of an instrument across two groups, usually referred to as 'reference group' and 'focal group.' In the present study, both groups consisted of clients, aged 18-80, from primary care psychology practices. The reference group, in which P&P 4DSQ data had been collected between December 2002 and February 2013, consisted of 2031 clients from a single large practice, whereas the focal group, in which Web-based 4DSQ data had been collected between April 2011 and September 2017, comprised 958 clients from 21 practices. In both groups, the data had been collected in the context of routine care.

Measure
The Four-Dimensional Symptom Questionnaire (4DSQ) is a 50-item self-report questionnaire measuring the four most common dimensions of psychopathological and psychosomatic symptoms in primary care settings (see Online Resource 1) [21]. The distress scale (16 items) aims to measure the kind of symptoms people experience when they are 'stressed' as a result of high demands, psychosocial difficulties, daily hassles, life events, or traumatic experiences [22]. The depression scale (six items) measures symptoms that are relatively specific to depressive disorder, notably, anhedonia and negative cognitions [23][24][25]. The anxiety scale (12 items) measures symptoms that are relatively specific to anxiety disorder [25][26][27]. The somatization scale (16 items) measures symptoms of somatic distress and somatoform disorder [28,29]. The 4DSQ employs a time-frame reference of 7 days. The items are answered on a 5-point frequency scale from 'no' to 'very often or constantly.' In order to calculate sum scores, the responses are coded on a 3-point scale: 'no' (0 points), 'sometimes' (1 point), 'regularly,' 'often,' and 'very often or constantly' (2 points) [21]. Collapsing the highest response categories ensures that relatively more weight is put on the number of symptoms experienced than on their perceived frequency. It also prevents the occurrence of sparsely filled, or even empty, response categories, which might cause estimation problems with various statistical procedures.
The four-dimensional factor structure of the 4DSQ has been confirmed in different samples [21,30]. However, as the focus of the present study was on the measurement properties of the separate 4DSQ scales, all analyses were conducted scale-wise, ignoring relationships between the scales. The 4DSQ is freely available for non-commercial use at http://www.4dsq.eu.

Initial analyses
We calculated basic descriptive statistics for the groups including gender composition, mean age and standard deviation (SD), and mean 4DSQ scale scores and SDs.
Because some calculations (e.g., model fit) require complete data, we applied single imputation of missing item scores using the 'response function' method that takes into account both differences between respondents and differences between items [31]. The method is superior to less sophisticated methods in recovering the properties of the original complete dataset [32].

Dimensionality and local independence
IRT requires that response data fulfill the assumptions of unidimensionality and local independence [33]. Unidimensionality refers to a scale's item responses being driven by a single factor, i.e., the latent trait that the scale purports to measure. Strict unidimensionality, implying that only the intended dimension underlies the item responses and no other additional dimensions affect these responses, is rare in psychological measurements [34]. However, 'essential unidimensionality' will suffice as long as there is one dominant dimension, whereas other, weaker, dimensions do not impact the item scores too much [35]. Local independence means that responses to one item should be independent from responses to the other items of a scale, conditional on the dimension that the items and the scale purport to measure [36]. Local item dependence (LID) actually results from one or more additional dimensions (beyond the intended dimension) operating on the item responses. Therefore, LID analysis can be used to assess the dimensionality of a scale.
For each scale, we examined its dimensionality by first fitting a unidimensional IRT graded response model. Model fit was assessed by the M2* statistic for polytomous data [37] and various fit indices. Relatively good fit is indicated by the following fit indices: Tucker-Lewis index (TLI) > 0.95, comparative fit index (CFI) > 0.95, standardized root mean square residual (SRMSR) < 0.08, and root mean squared error of approximation (RMSEA) < 0.06 [38]. Note that these benchmarks were developed in the context of structural equation modeling, and that their validity in the context of IRT is not well known. On the other hand, measurement models in IRT and structural equation modeling are formally equivalent [39]. LID was assessed using Yen's Q3 statistic [40]. This statistic represents the correlation between the residuals of two items of a scale after partialling out the dimension (or dimensions in case of a multidimensional model) that the scale purports to measure. The Q3 is not expected to be zero in the absence of LID. Due to 'partwhole contamination,' the expected Q3 proves to be slightly negative [41]. As proposed by Christensen et al., [42] we calculated critical Q3-values by parametric bootstrapping. For each group, for each scale, and for each (uni-or multidimensional) IRT model, we simulated 200 locally independent response data sets based on the item parameters and theta score distribution(s) obtained for a specified IRT model. We recorded the maximum Q3 for each dataset. Across the simulated datasets per group, per scale, and per model, we denoted the 99th percentile of the maximum Q3-values as the critical Q3-value. Observed Q3-values greater than this critical Q3-value were taken as indicating LID.
In order to assess the extent to which the scales can be considered to be essentially unidimensional, we build 'bifactor' models based on the LID information [34]. Bifactor models are characterized by one large general factor, underlying all items of a scale and measuring the intended construct of the scale, and one or more smaller specific factors underlying subsets of items [43]. Every item must load on the general factor and may load on one specific factor. In an iterative process, we tried to capture the LID by defining specific factors affecting items with the largest Q3-values in excess of the critical Q3-value [42]. After defining a specific factor, the bifactor model was assessed for model fit and Q3-values and the critical Q3 of that model were reassessed. Remaining LID was handled by assigning item pairs with LID to a new or existing specific factor until the LID was completely resolved, model fit deteriorated instead of improved, or standardized factor loadings < 0.2 emerged (standardized factor loadings < 0.2 represent less than 4% shared variance between the item and the factor). Importantly, our interest was in the 'purified' general factors and not in the minor specific factors.
In order to assess whether the scales were essentially unidimensional, we calculated the proportion of uncontaminated correlations (PUC) and the explained common variance (ECV), based on the best fitting bifactor models [44]. The PUC is an index of the data structure, i.e., an index of how many inter-item correlations are accounted for by the general factor only. Consider a 10-item scale and a bifactor model with 1 specific factor loading on four items. There are (10 × 9)/2 = 45 unique inter-item correlations among ten items. Within the specific factor, there are (4 × 3)/2 = 6 unique correlations. So, 6 out of 45 correlations among the items are confounded by the specific factor. Thus, the PUC is (45-6)/45 = 0.87. PUC values greater than 0.80 indicate low risk of bias when a multidimensional scale is treated as unidimensional [45]. The ECV is the common variance explained by the general factor divided by the total common variance of a scale, and represents an index of the relative strength of the general factor to the specific factors. ECV values greater than 0.70 are usually indicative of essential unidimensionality [46]. As an illustration of the bias caused by forcing multidimensional scales into a unidimensional model, we compared 2 theta estimations, one derived from the initial unidimensional IRT model ignoring LID, and another derived from the general factor of the best fitting bifactor model. Intra-individual theta differences greater than 0.2 or 0.5 logits (the metric of the theta scale) represent small or moderate differences in terms of effect size [47]. In addition, the Pearson correlation between the estimations was calculated.

Reliability
We calculated Cronbach's alpha as a measure of internal consistency reliability. In addition, we calculated omegatotal and omega-hierarchical coefficients based on the standardized factor loadings from the final bifactor models [43]. Omega-hierarchical can be regarded as an indicator of the strength of the general factor, and as such as a benchmark of essential unidimensionality [45].

Differential item functioning (DIF)
We used DIF analysis in the IRT context, in which the probability of endorsing an item response category is modeled as a function of certain item characteristics and the trait levels of respondents [48,49]. In the graded response model for polytomous items, the relationship between items and the underlying trait is defined by two types of item parameters, called 'difficulty' and 'discrimination.' The item-trait relationship can be graphically displayed by the item characteristic curve (Fig. 1). A polytomous item with 3 response categories is defined by two difficulty parameters (denoted b1 and b2) and 1 discrimination parameter (denoted a). The difficulty parameters (b1 or b2) are defined by the latent trait (theta) levels indicating the thresholds between response options. For the 4DSQ scales, b1 is located between category 0 and the two higher categories, while b2 is located between the lower categories and category 2 (see Fig. 1). The discrimination parameter a is defined by the slope of the item characteristic curve (at the thresholds b1 and b2), representing the item's ability to discriminate between respondents standing low and high on the trait. Two items (or two versions of the same item) are deemed equivalent when they have the same relationships with the underlying trait, that is, when the items have similar item characteristics (difficulties and discriminations). For more detailed information about IRT, we refer to some excellent introductory papers, which are freely accessible on the Internet [48,49].  1, and 2). Item parameters a (discrimination), b1 and b2 (difficulties) are indicated DIF analysis in the IRT framework thus implies testing the equivalence of the item parameters of the corresponding items across two groups, the focal and reference group. This can be done using the Wald test, after appropriately linking the groups, i.e., placing all subjects on a common metric. Linking is usually accomplished by 'anchor' items with known invariance across the groups. However, in the absence of pre-specified anchor items, we followed a 3-stage procedure to first select appropriate anchor items and then testing items for DIF [50,51]. In each stage, a multi-group unidimensional IRT graded response model was fitted to each scale in turn. The first stage constrained the item parameters to be the same across both groups to estimate the latent mean and variance of the focal group relative to the reference group. The second stage then provided preliminary linking between the groups by treating the estimated latent mean and variance as fixed, allowing the item parameters to be freely estimated and preliminarily tested for DIF. This stage was used to select items without DIF (p > 0.05) as anchor items. The third stage used the anchor items to link the groups, allowing means and variances freely estimated and the non-anchor items tested for DIF. Items with (Bonferroni corrected) p values < 0.001 and unsigned item difference in the sample (UIDS) values (see next section) effect sizes > 0.1 were deemed to have DIF. A UIDS of 0.1 is comparable with a standardized mean difference in item score of 5% of the item range (which is two points) [17].
To assess the severity of DIF, a final IRT model was fitted in which the parameters of the DIF items were freely estimated, while the parameters of the non-DIF items were constrained to be the same across both groups. The magnitude of DIF was then expressed as effect sizes based on expected item scores calculated twice for each member of the focal group, based on either the item parameters of the reference group or the item parameters of the focal group [52]. The signed item difference in the sample (SIDS) represents the mean difference in expected item scores in the focal group. The unsigned item difference in the sample (UIDS) represents the mean of the absolute difference in expected item scores in the focal group. Unlike the SIDS, the UIDS does not allow for cancelation of differences across respondents. The SIDS and UIDS are expressed in the metric of the scale score. On the other hand, the expected score standardized difference (ESSD) represents the Cohen's d version of the SIDS. ESSD values > 0.2 can be interpreted as representing a small effect and > 0.5 as representing a moderate effect of DIF.

Differential test functioning (DTF)
Differential test functioning (DTF), the scale-level impact of item-level DIF, was expressed by a number of effect size measures based on expected scale scores [52]. The signed test difference in the sample (STDS) represents the sum of all SIDSs across all items of a scale, and allows for cancelation of differences in expected scores across items and persons. The unsigned test difference in the sample (UTDS) represents the sum of all UIDSs across all items of a scale. The UTDS allows no cancelation across items or persons. The unsigned expected test score difference in the sample (UETSDS) represents the average of absolute values of the expected test score differences in persons. The UETSDS reflects the true behavior of DIF on observed scale scores as it allows for cancelation across items but not across persons. The expected test score standardized difference (ETSSD) is the Cohen's d version of the STDS.

Software
SPSS version 22 was used to prepare the data, impute missing item scores, and calculate mean scores and Cronbach's alphas. All other analyses were conducted using the package 'mirt' version 1.21 for multidimensional item response theory [53,54] within the statistical software R 3.2.5 [55].

Initial analyses
The groups were reasonably comparable with respect to gender composition, mean age, and 4DSQ scores ( Table 1). Thirty-four clients in the P&P group (1.7%) and six clients in the Web group (0.6%) had one or more missing item scores, which were imputed.

Unidimensionality and local independence
For every 4DSQ scale and for every group, Table 2 shows the fit indices of two models, the initial unidimensional model, and the final bifactor model. The M2* statistic follows a Chi-square distribution and, like the Chi-square statistic, is sensitive to large sample sizes. The fit indices suggested that the distress and somatization scales were not strictly unidimensional in both the P&P and the Web groups. On the other hand, the fit indices suggested relatively good fit of the unidimensional models of the depression and anxiety scales. Nevertheless, in all instances, some Q3-values suggested the presence of LID. The LID could not be resolved completely due to deterioration of model fit in case of the distress scale (in both groups), the depression scale (in the P&P group), and the anxiety scale (in the Web group). Standardized factor loadings < 0.2 occurred in case of the somatization scale (in both groups) and the anxiety scale (in the P&P group). The bifactor models demonstrated relatively good fit in all cases. The standardized factor loadings (provided in Online Resource 2) showed that the factor loadings of the unidimensional factors and the loadings of the general factors of the bifactor models of the same scales were very similar, suggesting that the unidimensional models predominantly represented the general factors [34].
The ECV values varied between 0.615 and 0.940, and the PUC values between 0.894 and 0.975, suggesting essential unidimensionality ( Table 3). The theta estimations based on the unidimensional models did not differ much from those based on the general factors of the corresponding bifactor models. For the distress, depression, and anxiety scales, the theta differences were negligible (< 0.2) in more than 95% of the participants. Regarding the somatization scale, theta differences were somewhat larger: theta differences > 0.2 (small effect size) occurred in about 30% of the participants, but theta differences > 0.5 (moderate effect size) occurred in less than 2%. The presence of minor specific factors did not cause important bias in the estimation of the trait scores when these factors are ignored and the data are forced into unidimensional IRT models. Table 4 presents an overview of the reliability estimates of the 4DSQ scales in both groups. Cronbach's alpha and omega-hierarchical were greater than 0.80 and omegatotal was greater than 0.90 for all scales in both groups. The omega ratios indicated that 88.7-97.9% of the total reliable scale score variance was accounted for by the general factor. This underlined the 4DSQ scales' essential unidimensionality.

Differential item functioning
After linking the groups on the latent mean and variance, suitable anchor items were identified for distress (seven items), depression (three items), anxiety (three items), and somatization (ten items). Ultimately, DIF was identified in three distress items and two somatization items (Table 5). All DIF was negative, indicating that the Web group tended to score a little lower on distress and somatization (conditional on the latent trait) due to the fact that the DIF items represented relatively somewhat more severe symptoms in the Web-based format (item parameters are provided in Online Resource 3). However, in terms of effect size (ESSD) the effect of DIF was small. For instance, the SIDS value of − 0.144 for 'nausea or an upset stomach' means that respondents to a Web-based 4DSQ scored on average 0.144 points lower on that item than respondents to a P&P 4DSQ with similar levels of somatization would do. Table 6 reveals that the impact of DIF on the distress and somatization scores was small in terms of mean difference in expected test scores across items and persons (STDS) and negligible in terms of Cohen's d (ETSSD). Figure 2 displays the expected scale scores as a function of the DIFfree theta score by group, showing that the 4DSQ scale scores obtained by means of a Web-based questionnaire

Discussion
We examined measurement equivalence of the 4DSQ across the traditional P&P version and a modern Webbased version using DIF and DTF analysis. We identified DIF in five items from two scales. In terms of effect size, the DIF was small. The impact of DIF on the scale level (DTF) was negligible. We employed a rigorous method to assess the dimensionality of the 4DSQ scales, using Yen's Q3 [40]. In combination with Christensen's method to determine critical Q3-values [42], the method turned out to be more sensitive to multidimensionality than more traditional fit statistics like the RMSEA. Our results indicate that the 4DSQ scales are essentially unidimensional, i.e., unidimensional enough to be treated as unidimensional in the context of IRT. Interestingly, the 4DSQ scales appeared to be slightly more unidimensional in the Web group than in the P&P group as evidenced by the slightly greater variance explained by the general factors (Online Resource 2). Apparently, the Webbased 4DSQ performs somewhat better, but certainly not worse, than the original P&P version.
DIF analysis is often concerned with inherently different groups (e.g., gender), in which case randomization is not feasible. In theory, DIF analysis is not hindered by group differences in trait levels because comparisons are matched at the trait level so that DIF only emerges when there is measurement bias rather than genuine trait differences. However, when groups differ in more respects than the trait level and the factor of interest (e.g., gender), the interpretation of the source of possible DIF may become problematic. In other words, when DIF is found, any aspect (other than trait level) in which the groups differ can potentially be the source of that DIF. Applied to the field of mode of administration equivalence research, no need for randomization may be an advantage of DIF analysis, but potential problems in the interpretation of DIF constitute a disadvantage. To avoid interpretation problems in this particular field, subjects can be randomly allocated to different mode of administration groups. But then data must specifically be collected for the evaluation of measurement equivalence, whereas data from different groups are often available 'on the shelf,' which is much cheaper.
In conclusion, using IRT-based DIF and DTF analysis to examine measurement equivalence across Web-based and P&P versions of the 4DSQ yielded few items with negligible DIF. Results obtained with the Web-based 4DSQ are equivalent to results obtained using the original P&P version of the questionnaire.
Conflict of interest BT is the copyright owner of the 4DSQ and receives copyright fees from companies that use the 4DSQ on a commercial basis (the 4DSQ is freely available for non-commercial use in health care and research). BT received fees from various institutions Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.