Introduction

The job content questionnaire (JCQ), developed originally in the USA [1], has been one of the most utilized instruments to measure psychosocial job characteristics due to its simplicity, reliability, and validity [2, 3]. As of 2008, it has been translated into twenty-three languages (http://www.jcqcenter.org). Considering the present paucity of adequate global surveillance data on work stress risk factors ([4], p. 5; [5], p. 73), existing large JCQ datasets from many countries can be good information sources for assessing psychosocial job hazards in a global economy. However, the comparability of these datasets should be assured in order to interpret them appropriately.

There are a number of methodological issues or requirements for a cross-cultural study using questionnaires [611]. One central issue is measurement equivalence across cultures: (a) “whether research instruments elicit the same conceptual frame of reference in culturally diverse groups” and (b) “whether respondents calibrate the intervals anchoring the measurement continuum in the same manner” ([12], p. 644). Measurement non-equivalence between cultures can be a serious threat to the validity of quantitative cross-cultural comparison studies because it becomes hard to tell whether observed mean differences or similarities are reflecting reality or simply measurement artifacts.

In this paper, measurement non-equivalence was evaluated through tests of differential item functioning (DIF) [1317]. An item shows DIF if “all respondents at a given level of the attribute measured (at a given index score) do not have equal probability of scoring positively on the item regardless of subgroup membership” ([17], p. 264).

In general, employing a standardized translation procedure such as translation and back-translation with bilingual translators is effective to reduce conceptual difference of an item between cultures [18] but cannot rule out its possibility completely due to relative insensitiveness of back-translation procedure to quality of translation [7, 19, 20]. In many cases, translation is not perfect due to culture-bound wordings of item stem (content) and options (response category) [9, 12, 2022]. In addition, ambiguous wording of the original item can be amplified through translation in target cultures. All of these can lead to measurement non-equivalence between an original and target cultures through eliciting subtly or substantially differential conceptual frames of reference from respondents [15, 23, 24].

Despite the worldwide use of the JCQ, there have been no international translation validation studies [25]. Few studies have examined cross-language or cross-national DIF of the JCQ statistically and/or qualitatively. Two previous studies [26, 27] suggest that JCQ items as other measures may function differently across countries (languages). Karasek et al. [26] raised a doubt about the consistency of the meanings of the JCQ “psychological demands” items across the USA, Canada, the Netherlands, and Japan. Choi et al. [27] reported that 16 of 22 tested JCQ items (of skill discretion, decision authority, psychological demands, supervisor support, coworker support, and physical demand scales) functioned differently between Chinese and Korean nurses.

However, to our knowledge, no study has examined cross-language DIF of the JCQ among European industrialized countries. While two international comparison studies of psychosocial job hazards[28] (Karasek et al., 2003, unpublished manuscript) used existing European JCQ datasets, they did not examine cross-language DIF of JCQ items beyond identifying the same factor structure in each country by exploratory factor analysis. Thus, the robustness of the cross-national or cross-regional mean comparisons of the JCQ scales in those studies remains in question.

However, another question follows: Whether or not the impact of DIF items on the scale level comparisons will be “substantial”. It is a much more practical and important question since the JCQ international comparisons or most of epidemiologic studies with the JCQ have been done at the scale-level rather than at the item-level. In the aforementioned DIF study [27], despite many DIF items of the JCQ, there were no significant impacts of the DIF items on the scale-level comparisons. Such a result was also reported in other DIF studies [15, 29].

To address these two questions, we revisited one of the aforementioned international comparison studies (Karasek et al., 2003, unpublished manuscript) that used the European JCQ database from Belgium, France, Italy, The Netherlands, and Sweden of the Job Stress, Absenteeism, and Coronary Heart Disease European Cooperative Study (JACE Study) [30].

The objectives of this study were to assess the extent of cross-language DIF of the 27 JCQ items (Table 1) among six research centers (Table 2) of the JACE study and to test whether its effect on the scale-level mean comparisons is substantial. In addition, a judgment review on translation equivalence between the Flemish and Dutch JCQs was employed for an exploration of possible causes of DIF items statistically identified. The multi-central and multi-language datasets made it possible to do various types of cross-language DIF analyses, comparing JCQ items in two different languages from the same country, two similar languages from two different countries, and two other languages from two different countries. Since this rich data set permits a large number of potential analyses, a new systematic methodology (Fig. 1) was devised and employed.

Table 1 The 27 items of the job content questionnaire analyzed for the cross-language DIF analysis
Table 2 Socio-demographic characteristics of the six populations for this study from the JACE-JCQ database
Fig. 1
figure 1

DIF analysis procedure in the JACE-JCQ database

Materials and Methods

The JACE-JCQ Database

The JACE prospective epidemiology study [30] primarily used the original JCQ to measure perceived job stressors in five European countries from 1991 to 1998. However, different psychosocial questionnaires were used in the sample from Malmo, Sweden and two sub-sites (roughly 20%) of the Milan sample [31]. In addition, the Gothenburg 93 sample was comprised of only men aged 50 years. These datasets were therefore excluded from this study, leaving six populations from six research centers in five European countries: Belgium (Ghent and Brussels), France (Lille), Italy (Milan), The Netherlands (Leiden), and Sweden (Gothenburg 95). The number of participants in each study ranged from 884 to 11,405 (Table 2).

Generally, the six samples included broad distributions of detailed occupations identified according to the International Standard Classification of Occupations (ISCO) four-digit codes of the International Labor Office [32], except for the Milan sample. The Swedish center used a general population sample. The other centers recruited a more or less diverse private and public employee population from a broad range of organizations. The Milan sample included only public employees from six departments of the city administration. The ISCO one-digit compositions of each sample differed by sample and gender.

Most of their age ranges spanned 35 to 59. Years of education varied by site and gender and from a mean of 9 to 14 years. The education variable was missing in the Milan and Leiden samples, so it was replaced by an estimation of years based on educational level attained (Karasek et al., 2003, unpublished manuscript). For more detailed descriptions of the JACE study, refer to previous publications [28, 30, 33].

Samples and Items for Cross-Language DIF Analyses

In the five samples from Belgium (Brussels, Ghent), France (Lille), Italy (roughly 80% of the Milan sample), and The Netherlands (Leiden), 27 questions of the JCQ (Table 1) were used, while 22 questions were utilized in the Swedish sample (Table 3). All JCQ items used a four-Likert type response: strongly disagree, disagree, agree, and strongly agree. American English JCQ items were translated to Belgian-Dutch (Flemish), Belgian-French, French, Italian, Dutch, and Swedish and then back-translated to English to assess the semantic equivalence by each research center [28]. In the Belgium samples, both Belgian-Dutch (Flemish) and Belgian-French versions of the JCQ were administered: Ghent (Flemish speakers, 100%) and Brussels (Belgian-French speakers, 67.7% and Flemish speakers, 32.3%).

Table 3 The results of the primary and secondary DIF analyses of the 27 JCQ items between Ghent sample (reference) and each focal sample of the JACE-JCQ database

Only the English back-translation of the Flemish JCQ was available for this study. The items, Q7, Q19, Q20, and Q52 had been noted to have translation nuances in a posteriori review on the back-translation of the Flemish JCQ by the JCQ Center.

Weighting Samples

To prevent potential effects of different occupation compositions among the samples on DIF analyses, each sample for both men and women was weighted by the composition percentages of ISCO one-digit codes of the full JACE [33]. The weighting process generally resulted in considerable reduction of standard deviations of the JCQ scale means (27% for men and 46% for women, respectively, across all samples) [33].

DIF Analysis Procedures

First, exploratory factor analyses generally supported the assumption of one factor for each of the JCQ scales in all samples (Table 1). However, Q26 (from the psychological demands scale) had factor loadings of less than 0.30 in most of the samples [33]. Cronbach’s alpha values of the JCQ scales, on average, ranged from 0.59 to 0.86 [33] (Fig. 1).

Second, cross-language DIF analyses were done between the reference and focal samples for each of the 27 JCQ items. The Ghent (Belgian-Dutch, Flemish) sample was chosen as the reference for other language samples (Brussels, Belgian-French/Flemish; Lille, French; Milan, Italian; Leiden, Dutch; Gothenburg 95, Swedish) to enable comparison of two linguistically similar languages (Dutch and Flemish) from two different countries, as well as two different languages from the same country (Belgian French vs. Belgian-Dutch). The Ghent sample also had the advantages of a large sample size, the second purpose of this study (impact of DIF items on the scale-level mean comparisons), availability of English back-translation, and documented psychometric validity [34]. Due to one missing item in the physical demand scale in the Gothenburg 95 sample, a four-item (Q21, Q24, Q25, and Q30) version of the physical demand scale was constructed for the DIF analyses.

Third, the partial gamma coefficient method [15, 17, 24, 35] was used for DIF statistics. The partial gamma coefficient is a variant of Kendall’s τ, which is zero when two observations are as likely to be discordant as concordant given the conditional independence between item and variable of interest. Partial gamma coefficients were initially calculated at each score of a scale and finally combined across scale scores. To simplify the interpretation of cross-language DIF and detect the most pronounced differences, we chose the criterion, “moderate to large” DIF (category C) over “slight to moderate” DIF (category B) of Bjorner et al. [15]. Category C was defined as items with partial gamma outside the interval (−0.31 to 0.31) and its 95% confidence interval significantly outside the interval (−0.21 to 0.21); category A (no or negligible DIF) as items with partial gamma within the interval (−0.21 ~ 0.21) or its 95% confidence interval including zero; category B as items located between categories A and C.

Fourth, the impact of differential socio-demographic characteristics between Ghent and focal samples on the DIF analyses above (called “primary” DIF analyses hereafter) was examined. The primary DIF analyses were replicated (called “secondary” DIF analyses hereafter) after controlling for both sex and education between Ghent and Lille, between Ghent and Milan, and between Ghent and Leiden; both sex and age between Ghent and Gothenburg 95. There was no need for the secondary DIF analyses between Ghent and Brussels due to their similar sample characteristics. Education and age were both dichotomized (up to vs. greater than 12 years of education; up to vs. greater than 45 years old). The results of the secondary DIF analyses were conservatively preferred to those of the primary DIF analyses, considering the potential confounding effects of age, sex, and education.

Fifth, as an exploration of causes of DIF items (i.e., category C) statistically identified, the translation equivalence of the 27 items between the Flemish and Dutch JCQs was evaluated by two trilingual (English/Flemish/Dutch) researchers (authors EC and MB) who had not been involved in the translation process for either version or the DIF statistical analysis. They were asked independently (a) to evaluate conceptual non-equivalence (e.g., very fast vs. very hard) of each JCQ item between the two versions and (b) to report any differences in terms of missing or adding words (e.g., very fast vs. fast). They then were asked to come up with a final set of agreed evaluations through discussion, particularly on their initial, discrepant evaluations (on eight items). The final evaluation was compared with the result of the statistical DIF analysis between the two versions.

Sixth, the impact of identified DIF items on the mean comparisons of the JCQ scales between Ghent and each of the focal samples was examined in separate analyses for men and women. Three criteria were applied to judge the impact. The means of each full JCQ scale were compared. The comparison was then replicated with the reduced scale with non-DIF items in the secondary DIF analyses. If the two mean comparisons differed in terms of rank order of samples with statistical significance (alpha value = 0.01), it was suspected that DIF items substantially affected the scale-level mean comparison. Then, it was finally considered if the mean comparison of the reduced scale “with” DIF items was similar to those with the full scale. For sensitivity test, an effect size measure, Cohen’s d (the difference of means divided by the pooled standard deviation; 0.20, “small”; 0.50, “medium”; and 0.8, “large” [36]), was additionally employed.

Finally, the impact of identified DIF items on the mean comparisons of the JCQ scales among multi-language samples (i.e., Ghent, Brussels, Lille, Milan, Leiden, and Gothenburg 95) was also examined by sex. The means of the JCQ scales with the full items, the best non-DIF item(s) and the worst DIF item were compared with the same criteria as in the sixth procedure above. The multiple comparisons were undertaken with Student–Newman–Keuls test (alpha = 0.001).

The SPSS (version 16.0) statistic program was used for all statistical analyses.Footnote 1

Results

Cross-Cultural Comparisons of the DIF Items by Site

Fifty-one of the total tested 130 items (39.2%) appeared to be DIF items in the primary DIF analyses between Ghent and the focal samples. The percentages of DIF items were varied by center and JCQ scale (Table 3). They were high with the Milan sample (55.6%) and low with the Brussels sample (11.1%). The results of the secondary DIF analyses were very similar to those of the primary DIF analyses, although the number of DIF items substantially decreased in the secondary DIF analysis with Gothenburg 95 sample. The total number of DIF items slightly decreased to 47 of 130 items (36.2%) in the secondary DIF analyses.

The items, Q7 and Q11 of the skill discretion scale, Q48, Q51, and Q52 of the supervisor support scale, Q58 of the coworker support scale, and Q31 of the physical demand scale appeared to be DIF items in half or more of both the primary and secondary DIF analyses. All of the coworker support items in the Gothenburg 95 sample were DIF items in the secondary DIF analyses with Ghent sample (Table 3).

At the scale level, the decision authority scale had the most DIF-free items (on average, 86.7%) across the five primary DIF analyses, followed by coworker support (75.0%), physical demand (62.5%), psychological demands (60.0%), skill discretion (50.0%), and supervisor support (50.0%). In the case of the five secondary DIF analyses, the decision authority scale was also the best (on average, 86.7% DIF-free), followed by coworker support (75.0%), physical demand (75.0%), psychological demands (62.5%), skill discretion (60.0%), and supervisor support (56.3%). The skill discretion, psychological demands, and supervisor support scales were the most affected by DIF in both primary and secondary analyses.

Comparison Between Judgmental Reviews on Translation Equivalence and Statistical DIF Analyses

In the judgmental review, no translation problems were observed for the four items (i.e., Q7, Q19, Q20, and Q52) that had been noted to have translation nuances in the posteriori review on the back-translation of the Flemish JCQ. Instead, the review indicated some slight differences between the Flemish and English JCQs in items Q6, Q10, Q26, Q51, and Q52 (Table 4). For instance, the word “a lot” in item Q10 (English) was missed in the Flemish version.

Table 4 Independent judgmental review on the translation comparability of the 27 JCQ items between the Flemish and Dutch JCQs

There were various types of translation differences in the 14 JCQ items (Table 4): missing/adding a word (Q10, Q26, Q51, Q52, and Q54); different frequency-or-extent-related adverbs/adjectives (Q9 and Q6); translation nuance (Q4, Q11, Q53, and Q25); translation nuance plus missing/adding of a word (Q30, and Q31); and obvious conceptual difference (Q48).

The above groups of items were identified as category A, B, or C in the statistical DIF analyses, which reflects differential significance of translation differences in their respective item context.

In total, five out of ten DIF items (category C) in statistical analyses were associated with translation differences noted by the reviewers. The other DIF items were not associated with any translation differences: Q3, Q7, Q20, Q22, and Q58; they were all categorized as category C even if they were judged as highly translation equivalent in the independent review.

Impact of DIF Items on the Scale-Level Mean Comparisons with the Ghent Sample

The decision authority, psychological demands, and physical demands scales had no DIF items between the Ghent and Brussels samples. The means of skill discretion with both the full items and the non-DIF items were significantly higher in Ghent men than in Brussels men (Table 5). In contrast, the mean of skill discretion with the DIF item (Q11) was significantly higher in the Brussels sample than in the Ghent sample. Likewise, the mean comparisons of the other two JCQ scales between Ghent and Brussels by sex were not affected substantially by their DIF items. The differences of Cohen’s d values between the scale-mean comparisons with the full items and with the non-DIF items were less than 0.10.

Table 5 The mean comparisons of the skill discretion, supervisor support, and coworker support scales of the JCQ with the respective full items, non-DIF items, and DIF item between Ghent and Brussels samples

Two of the 36 mean comparisons of the JCQ scales (that included at least one DIF item) between Ghent and the other focal samples for both men and women appeared to be substantially affected by DIF. The two were related to the lack of non-DIF items: coworker support between Ghent and Gothenburg 95 for both men and women. The differences of Cohen’s d values between the scale mean comparisons with the full items and with the non-DIF items were not greater than 0.20 (i.e., small [36]) in almost all of the other 34 comparisons. However, the differences of Cohen’s d values were between 0.20 and 0.50 (i.e., middle [36]) in the two comparisons: The mean differences of skill discretion between Ghent and Milan samples for both men and women were much smaller when the non-DIF items (Q3 and Q9) were used for the mean comparisons than when the full items were used.

Impact of DIF Items on the Scale-Level Mean Comparisons Among Multi-Language Samples

The rank-orders of the multi-language samples for skill discretion with the full items and the non-DIF item (Q9) were very similar to each other (Table 6). However, Milan had significantly higher skill discretion with the non-DIF item particularly in women: The rank of Milan women substantially changed from one of the lowest with the full items and the DIF item (Q11) to the third highest with the non-DIF item. The ranks of Leiden for decision authority with the non-DIF item (Q8) for both men and women were significantly higher, compared to those with the full items or DIF item.

Table 6 The mean comparisons of the JCQ scales with the respective full items, non-DIF items, and worst DIF item among the six samples of the JACE-JCQ database

There was no DIF-free item for psychological demands across the samples. In addition, the percentages of DIF items across the samples were 40% in all of the five psychological job demands items (see Table 3). Therefore, the two items, Q20 and Q22, were arbitrarily chosen as the best non-DIF items and one item, Q23, as the worst DIF item for the multiple sample comparisons. Milan women had one of the lowest psychological demand means with the full scale and the DIF item (Q23) but not with the non-DIF item (Q22) (Table 6). Swedish coworker support value was not comparable to other samples, and there was no substantial DIF impact case in the multi-sample mean comparison for physical demand.

Discussion

This study examined cross-language differences in the meaning of 27 JCQ items and the impact of those differences on the scale mean values in a large dataset from five European countries. Despite the very similar factor structure among the samples, 36–39% of the total tested items showed cross-language DIF. The impacts of the DIF items on the mean comparisons of the JCQ scales among the six multi-language centers were non-trivial: underestimated skill discretion (Milan), underestimated decision authority (Leiden), underestimated psychological demands (Milan women), and incomparable coworker support (Gothenburg 95). Furthermore, a comparison of the JCQ translations into Flemish and Dutch suggested non-equivalence for one half of the DIF items. Cross-language differences, from translation or from cultural norms, at least among European languages, should be considered in any international comparative study using the JCQ scales.

Methodology of Cross-Language DIF Analysis

Item response theory (IRT) models and multi-group confirmatory factor analysis method are known to be the most advanced and sophisticated methods for DIF statistics [37, 38]. However, the applicability of IRT models highly depends on the sample size to obtain stable parameters. More importantly, its applicability to job and occupational analysis data has not yet been fully explored [39]. We think that the partial gamma coefficient method has advantages over multi-group confirmatory factor analysis in terms of simplicity, understandability, and applicability to wide ranges of sample sizes.

The procedure for cross-language DIF analysis in this study was methodologically robust (Fig. 1). Every step of the procedure was necessary and indeed contributed to reducing errors in DIF analyses. However, our DIF analyses might underestimate the extent of cross-language DIF items. First, we focused on “moderate to large” (Category C) DIF items, which was a realistic choice considering the multi-language DIF analyses of this study. Applying a stricter criterion (category B, “slight to moderate”) [15] would have produced a higher number of DIF items. Second, only weighted partial gamma across scale scores were used for the criterion of cross-language DIF. However, the weighted partial gamma of an item may not reflect the DIF of the item at a specific range of scale scores, which means that DIF, at a specific range of scale scores, could be overlooked.

To do multi-language DIF analyses involving at least three languages imposed additional analytical difficulty. As the number of languages for comparison increases, more DIF items are likely to be found, and the probability of finding non-DIF items across the multi-languages decreases. It was inevitable to use the next best non-DIF item(s) across the samples in order to complete the mean comparisons for some JCQ scales (in case that there was no single DIF-free item across the samples). The three criteria of substantial impacts of DIF items on the mean comparisons of the JCQ scales were considered jointly in this study due to the following reasons. First, we think that either rank-order change or statistical significance change of the sample means of the JCQ scales is not perfect alone because the former tends to exaggerate trivial impacts and the latter depends on sample sizes. However, rank-order change needs to be more weighted in the small data, considering reduced power of statistical significance test. Second, we think that the third criterion (i.e., similarity of the mean comparisons of each of the JCQ scales with its full items and with its DIF items) is a necessary condition because a significant discrepancy of the mean comparisons between the full scale and the reduced scale with non-DIF items could occur for other reasons (e.g., multidimensionality of a scale).

Possible Causes of Statistical DIF Items

We cannot determine specific causes of cross-language DIF items statistically identified in this study. However, some possible sources can be discussed, and several conspicuous patterns of the DIF items across the JACE samples deserve to be discussed.

The most distinct source of the cross-language DIF items in the JACE database seems to be translation-related difference. A half of the statistical DIF items between the Flemish and Dutch JCQs were associated with translation differences, ranging from a simple missing/adding word to obvious translation non-equivalence. The proportion (50%) was not unusual, compared to those (27–44%) in other cross-language DIF studies [40, 41]. It is also understandable, considering the fact that there was no pre-designed protocol for addressing translation equivalence across the research centers in the JACE study (Houtman et al. 1998).

However, it needs to be remembered that the other half of the statistical DIF items were not related to any translation differences in the judgmental review. In addition, the proportion of DIF items was much smaller in the analysis between two different language samples (Ghent and Brussels) from the same country than in the analyses between two Dutch speaking samples (Ghent and Leiden). All these imply that a national-level culture [42], interacting with structures and functions of institutions, might play a role as a source of DIF of the JCQ items in the JACE database.

The items highly vulnerable to cross-language DIF in the JACE database were Q7, Q11, Q48, Q51, Q52, Q58, and Q31. The skill discretion, psychological demands, and supervisor support scales were the most affected. This is consistent with the finding of Karasek et al. [26], with respect to inconsistent meanings of psychological demands items across the populations from industrialized countries. In addition, this study suggests that other JCQ items (particularly, supervisor support items) may be also differently understood among European countries. One reason for that may be that “demands and social support reflect to a great extent local work site conditions and individual perception” ([43], p. 18). Furthermore, the scale-level differential impact of DIF items might provide a clue for relatively higher heterogeneous associations of psychological job demands and social support at work with common mental health across European countries, compared to those of decision authority [44]. The items that are prone to DIF need to be considered both for improving the quality of the existing translated versions of the JCQ and exploring unique cultural characteristics (“emic” approach, see Peng et al. [9]) among the European countries in the future. In addition, their vulnerability to cross-language DIF needs to be considered seriously in the future version of the JCQ (JCQ 2.0, http://www.jcqcenter.org).

To reduce cross-language DIF of the JCQ in the future, it will be desirable to employ a stricter translation process as confirmed in the case of the Flemish JCQ: The translation and back-translation procedure were less sensitive to quality of translation than the independent review, which is consistent with the previous studies [7, 19, 20]. Useful techniques also include quantitative DIF analyses and qualitative interview (e.g., see [45]).

International Comparison of Psychosocial Job Hazards using the Existing JCQ Data

This study suggests that the previous international mean comparison using the JACE-JCQ database (Karasek et al., 2003, unpublished manuscript) needs to be carefully reviewed, considering the DIF impacts on the scale-level mean comparisons identified in this study. It would be wise to use only non-DIF JCQ items for more accurate international comparisons with the JACE-JCQ datasets in the future.

Lastly, we emphasize that this study was undertaken with the JCQ database from the five European countries sharing relatively similar cultures. Thus, the measurement equivalence test of the global JCQ database from European, North American, Asian, and Latin American countries with significantly different cultures still remains to be tested in the future.