Background

The 12-item General Health Questionnaire (GHQ-12) is a self-report measure of psychological morbidity, intended to detect "psychiatric disorders...in community settings and non-psychiatric settings" [1]. It is widely used in both clinical practice [2], epidemiological research [3] and psychological research [4].

The GHQ-12 has been extensively evaluated in terms of its validity and reliability as a unidimensional index of severity of psychological morbidity [59]. Many studies, however, have reported that the GHQ-12 is not unidimensional, but instead assesses psychological morbidity in two or three dimensions [1019]. Several two- and three-dimensional models have been proposed (see Martin & Newell 2005 for review [20]), and to date no study examining the factor structure of the GHQ-12 has found it to be unidimensional. These various factors have been interpreted as substantive psychological constructs such as 'Anxiety', 'Psychological distress', 'Social Dysfunction', 'Positive Health', for example. When these competing models have been compared [10, 20], confirmatory factor analysis suggests that the best fitting model is the three-dimensional model of Graetz [21], which proposes that the GHQ-12 measures three distinct constructs of 'Anxiety', 'Social dysfunction' and 'Loss of confidence'.

The GHQ-12 was validated on the assumption that it was a unidimensional and generic measure of psychiatric morbidity: if it truly is multidimensional, measuring psychological functioning in three specific domains, then the validation data may be questioned, and hence the use of the GHQ-12 in clinical practice and research contexts.

Despite the apparently consistent evidence that the GHQ-12 is multifactorial, the findings of these studies could be interpreted as demonstrations of a phenomenon long-known in the psychometric literature: that measure comprising a mixture of positive and negative statements can produce an entirely artefactual division into factors [2227]. Such artefacts, introduced as they are by the method of measurement, are known as 'method effects'.

The GHQ-12 itself comprises six items that are positive descriptions of mood states (e.g "felt able to overcome difficulties") and six that are negative descriptions of mood states (e.g. "felt like a worthless person"). For brevity these will be referred to as 'negatively phrased items' (NP items) and 'positively phrased items' (PP items) respectively. Consistent with the hypothesis that these factors derive from method effects, a casual inspection of the two and three factor structures previously reported reveals that they comprise most (or all) of the NP items and most (or all) of the PP items. For example, of the four two factor solutions included in the review by Newell & Martin, two comprise all of the NP items vs. all of the PP items, and two comprise all of the NP items plus one PP item vs. the remaining PP items. Of the three three factor solutions reviewed, the additional factor comprises either two NP items or two PP items, i.e. the division between NP and PP items is maintained. In addition, when reported, the correlations between dimensions are very high; this is again consistent with the hypothesis that the GHQ-12 is unidimensional, and that the apparent two- or three-dimensional structure is artefactual. Hence the question arises of whether the factors identified in these studies truly reflect clinically distinct dimensions of psychological morbidity. If the dimensions identified by these studies are simply due to the measurement bias introduced by method effects, then the constructs identified do not exist (a methodological error known as 'reification' [28]).

The analyses so far reported employed either exploratory factor analysis (EFA) to derive the factor structure or confirmatory factor analysis (CFA) to confirm the factor structure of the GHQ-12. EFA cannot in principle distinguish between genuine multidimensional structure and spurious factors generated by method effects. CFA, used appropriately, can model such method effects and so test the hypothesis that the apparent factor structure is artefactual. Unfortunately, studies employing CFA have only tested the models originally generated by EFA; used in this way, CFA is also in principle incapable of distinguishing between a genuine factor structure and an artefactual one.

The modelling of the proposed method effect is, however, relatively straightforward. Various mechanisms have been proposed for the general phenomenon of NP and PP items separating into factors, focusing on respondents' difficulties in processing negatively-phrased items. It has been suggested that negatively phrased items are more difficult to process because of inattention, variation in education, or an aversion to negative emotional content [22, 24, 25]. Whatever the cause, since response bias introduces variation in responses not due to variation in the measured construct, nor to random measurement error (both of which should apply equally to all items), then it may be hypothesised that response bias creates additional variation in negatively-phrased item scores, and that this additional variance is common to the negatively phrased items but not the positively phrased items. Two hypotheses follow from this: (1) NP items should have significantly higher variance than PP items and (2) a unidimensional model incorporating response bias on the NP items should fit the data better than the two- or three-factor models proposed.

The aim of this study was therefore to explore the possibility that the previously-reported factor structures of the GHQ-12 were due to method effects, namely a response bias on the negatively phrased items.

Methods

GHQ-12 data were obtained from the 2004 cohort of the Health Survey for England, a longitudinal general population conducted in the UK. Sampling and methodological details are in the public domain [29]. For the purposes of this study, a single adult was selected from each household to maintain independence of data. Of the 4000 such adults, 3705 provided complete data for the GHQ-12. The Likert scoring method was used, with each of the twelve items scored in the range 1 to 4, 4 indicating the most negative mood state. For clarity, the six NP items were labelled n1 to n6 and the six PP items p1 to p6.

The variances of all items were computed with 95% confidence limits, the hypothesis being that the variances on the negative questions would be uniformly greater than those on the positive questions. In addition, Pearson correlation coefficients were calculated between all items (producing a correlation matrix). Exploratory factor analysis was first conducted to explore whether the data would replicate either the two- or three-factor solutions previously reported; in the context of this study, this was a necessary step, since in the absence of a multifactorial solution there would be no method effect to explain. Consistent with most of the previous EFA analyses, the principal components method was used, with orthogonal (Varimax) rotation.

Following EFA, four models were tested for goodness of fit (CFA) using the structural equation modelling package AMOS 6.0 (maximum likelihood method):

1. Unidimensional: one factor, i.e. the intended GHQ-12 measurement model

2. Confirmatory analysis of the EFA solution

3. Three dimensional: three correlated factors, derived from the Graetz [21] model of three factors measured by 6 items ('anxiety', p1 to p6), 4 items ('social dysfunction', n1 to n4) and two items ('loss of confidence', n5 and n6)

4. Unidimensional with method effects: one factor (essentially Model 1) with correlated error terms on the NP items

Models 1 to 3 are typical of the EFA/CFA approach taken to date. Model 4 is the model hypothesised to have the best fit for the data. Path diagrams for all models are presented in Figure 1. As recommended by Byrne [30] a range of goodness of fit indices were computed for model comparison. These indices differ primarily in their treatment of sample size, model parsimony and the range of values considered to indicate a well-fitting model. The goodness-of-fit index (GFI), adjusted goodness-of-fit index (AGFI), normed fit index (NFI) and comparative fit index (CFI) all increase towards a maximum value of 1.00 for a perfect fit, with values around 0.950 indicating a good fit for the data. In contrast, the values of the root mean square error of approximation (RMSEA) and the expected cross-validation index (ECVI) decrease with increasingly good fit, and are not limited to the range zero to one. The ECVI indicates how well the model will cross-validate to another sample, while the RMSEA provides a 'rule of thumb' cutoff for model adequacy of less than 0.08.

Figure 1
figure 1

Model specification (models 1 to 4).

Results

Exploratory factor analysis (EFA)

Table 1 shows the Pearson correlation coefficients for all items, with correlations between similarly-phrased items highlighted in bold. Exploratory factor analysis (principal components analysis with Varimax rotation) suggested a two factor solution explaining 59.0% of the total variance in items (factor 1 eigenvalue = 3.9; factor 2 eigenvalue = 3.1). Consistent with the hypotheses of this study, the first factor comprised all of the NP items and the second factor all of the PP items. Table 2 shows the rotated component matrix for all items, illustrating the division between NP and PP items. This two-factor model was further examined by CFA, below (hypothesis 2, model 2).

Table 1 Pearson correlations between all items (PP items p1 to p6; NP items n1 to n6)
Table 2 Rotated component matrix for exploratory factor analysis

Hypothesis 1: Comparison of the variances of NP and PP items

Table 3 shows the variances of all items with 95% confidence limits (see also Figure 2 for a graphical representation of the variances of the items). It can be readily seen that the variances of the NP items (median 0.515) were uniformly higher than those of the PP items (median 0.215), with no overlap in the 95% confidence limits. This supports the first hypothesis that the NP items would have increased variances due to additional error variance caused by response bias.

Table 3 Variances (95%CL) for PP and NP items
Figure 2
figure 2

Item variances (PP items p1 to p6; NP items n1 to n6).

Hypothesis 2: Comparison of models

Table 4 shows the goodness-of-fit measures for all models. Of the previously reported models the best fitting was the Graetz three dimensional model; indeed, of models 1 to 3 this was the only model with an acceptable fit (RMSEA = 0.073, 90%CL (0.069, 0.077); ECVI = 0.301, 90%CL (0.274, 0.331). Model 4 was, however, superior in all fit measures to model 3 (RMSEA = 0.068, 90%CL (0.064, 0.073; ECVI = 0.214, 90%CL (0.191, 0.238)), thus confirming the second hypothesis.

Table 4 Fit measures for all models

Exploration of NP response bias

Figure 3 shows frequency histograms for the six PP items. It can be seen that most of the respondents indicated the presence of positive mood states by responding on point 2 of the four point Likert type scale, indicating that these positive mood states were experienced with the 'Same' frequency as usual (i.e. a lack of pathology was indicated by a PP item score of 2). In contrast, Figure 4 shows the frequency histograms for the six NP items. It can be seen that the same respondents indicated an absence of negative mood states for items n1 to n4 with a response of either 1 or 2 on the four-point scale. These correspond to responses of 'Not at all' and 'No more than usual'. Hence, respondents wishing to indicate an absence of negative mood states were forced to choose between two response options due to the ambiguous wording, and the distribution was split between the two scale points. Items n5 and n6 correspond to the extreme negative mood states of 'Losing confidence' and 'Feeling worthless', for which the response 'No more than usual' makes little sense in the absence of such mood states. Hence for these items there was less ambiguity and correspondingly more respondents opted for point 1 of the scale.

Figure 3
figure 3

Frequency histograms of PP item responses (items p1 to p6).

Figure 4
figure 4

Frequency histograms of NP item responses (items n1 to n6).

The additional variance in the NP items may therefore be explained: the ambiguous response options for the NP items shift many responses from point 2 to point 1, increasing the scale variance. The appearance of two factors in previous studies is also explained in this way, since the shift in the distribution results in the NP items being more correlated with each other than with the PP items (see Table 1: median NP inter-item correlation r = 0.58; median PP inter-item correlation r = 0.43; median NP inter-item correlation partialling PP items r p = 0.43), thus generating a spurious two factor solution comprising NP and PP items. The Graetz three factor model may also be attributed to this mechanism, with an additional factor generated by the two least ambiguous NP items (n5 and n6 correspond to the factor of 'Losing confidence' reported by Graetz).

Discussion

Simple exploratory factor analysis replicated previous studies, suggesting a two factor solution comprising exclusively NP and PP items. Confirmatory factor analysis, however, suggested that this model was not a good fit for the data (RMSEA = 0.086; 90% CL = 0.082, 0.090). The best fitting model was the hypothesised model with response bias on the NP items (RMSEA = 0.068; 90% CL = 0.064, 0.073). Interpreting the simple EFA solution as signalling the existence of substantive constructs (for example, factor 1 as 'depression' and factor 2 as 'social dysfunction') was therefore not warranted by the data, and would represent an example of the methodological error of reification. That the best fitting model was the model incorporating method effects suggests that the most parsimonious interpretation of the data is that the GHQ-12 is largely or wholly unidimensional, but with substantial response bias on the NP items. While this finding is in accord with the intended function of the GHQ-12 (as a unidimensional, non-specific index of psychiatric morbidity), it should be remembered that the accuracy of measurement depends on low measurement error and the absence of substantial response bias. One assumption of this study is that the random measurement error is approximately equal across all items, and that any difference in variance between NP and PP items is largely due to response bias. Although this assumption is confirmed in the analysis of the variance of the items and the pattern of inter-item correlations, it is of course also consistent with a genuine two or three factor solution in which the factors themselves have different variances. This alternative explanation is, however, even less parsimonious, since it hypothesises not only that multiple factors represent substantive constructs, but that those constructs differ in variance along the same lines as the NP and PP items. Further, since no previous study has hypothesised in advance that (a) multiple factors exist and (b) they will differ significantly in their variance, justification of this kind would be ex post facto, in contrast with the a priori hypothesis tested in this study.

In contrast to the causal mechanisms proposed in previous studies, the response bias seems not to result from inattention, lack of education or aversion to negative phrasing, but from the ambiguous response frame used for the NP items. Respondents wishing to indicate an absence of a negative mood state are faced with two possible choices: 'Not at all' or 'No more than usual'. For respondents not suffering from psychiatric problems, either of these options reflects their current status. This ambiguity increases the variance of the NP items and adds a correlated component to the error term for these items, thus generating spurious factors. It remains to be seen if this method effect applies in populations with a higher prevalence of psychiatric disturbance.

Conclusion

While many studies report the factor structure of the GHQ-12 as two- or three-dimensional, it appears that these are spurious findings resulting from a neglect of method effects, specifically a response bias on the items expressing negative mood states. This is turn seems to be due to the ambiguous response options for indicating the absence of negative mood states. While the results presented here are consistent with the GHQ-12 being unidimensional as originally intended, the presence of substantial response bias on the negatively phrased items required careful attention in the analysis of clinical epidemiological data. This is especially true of the assessment of the reliability of the GHQ-12, since the computation of reliability coefficients such as Cronbach's alpha assumes no such bias, and the resulting reliability may be over- or under-estimated.