Introduction

There is increasing international interest in the concept of positive mental health and its contribution to all aspects of human life [1, 2]. The term is often used, in both policy and academic literature, interchangeably with the term mental well-being. It is a complex construct, which is generally accepted as covering both affect and psychological functioning as well as the overlapping concepts of hedonic and eudemonic well-being [3]. Positive mental health is recognised as having major consequences for health and social outcomes [4, 5], and has given rise to new therapies that explicitly focus on facilitating positive mental health [6] and to health promotion programmes which aim to develop mental well-being at community level. The field of positive mental health is under-researched partly because of the lack of appropriate measures [7] and there is demand for instruments suitable for use with both individuals and populations.

The Warwick-Edinburgh Mental Well-Being Scale (WEMWBS) was developed to meet this demand [8]. It is an ordinal scale comprising 14 positively phrased Likert- style items. Development was undertaken by an expert panel drawing on the current academic literature, qualitative research with focus groups, and psychometric testing of an existing scale (the Affectometer 2). The new scale was validated on student and representative population samples in the UK using qualitative as well as quantitative methods and performed well against classic criteria for scale development [9]. WEMWBS showed good content validity, moderately high correlations with other mental health scales and lower correlations with scales measuring overall health. Its distribution was near normal and did not show ceiling effects in population samples. It discriminated between population groups in a way that is largely consistent with the results of other population surveys. Test-retest reliability at one week was high (0.83). Social desirability bias was lower than or similar to that of other comparable scales.

WEMWBS' internal scaling properties were tested using internal construct validity in the form of confirmatory factor analysis. Results were consistent with a single underlying construct. Internal consistency reliability was assessed using Cronbach's Alpha [10], which suggested item redundancy. In the context of testing scales based on ordinal data, it has been argued that both the latter approaches are inappropriate, given that factor analysis is parametric and requires interval scaling, and Cronbach's Alpha does not address unidimensionality [1113]

Recently, modern psychometric approaches have been adopted to provide a more robust interpretation of the internal construct validity of ordinal scales, the most widely applied of which is the Rasch Measurement Model [14]. In this approach, data which include items intended to be summated into an overall ordinal score for a specific scale are tested against the expectations of this measurement model. These expectations are a probabilistic form of Guttman Scaling which operationalises the formal axioms that underpin measurement [15, 16]. Other issues such as category ordering (do the categories of an item work as expected?) and item bias, or Differential Item Functioning (DIF) [17] may also be addressed within the framework of the Rasch model. Finally, when data are found to fit model expectations a linear transformation of the raw ordinal score is obtained, opening up valid parametric approaches, given appropriate distributions [18, 19].

In this report we assess the internal construct validity of the 14-item Warwick-Edinburgh Mental Well-being Scale (WEMWBS) from the perspective of the Rasch Measurement Model using data collected from Wave 12 (Autumn 2006) of the Scottish Health Education Population Survey (HEPS).

Methods

The Warwick-Edinburgh Mental Well-being Scale (WEMWBS)

WEMWBS differs from other scales of mental health in that it covers only positive aspects of mental health and all 14 items are phrased positively (see additional file 1). Items cover a range of aspects of mental well-being including many which will be familiar from other well known scales (e.g. I've been feeling relaxed, I have been thinking clearly). Responses in the form of a Likert scale comprise 'None of the time'; 'Rarely'; 'Some of the time'; 'Often' and 'All of the time'. Scores range from 14 to 70, with a higher score reflecting a higher level of mental well-being.

The Health Education Population Survey (HEPS)

HEPS was a Scottish population survey in which data were collected on an annual basis in two waves (Spring and Autumn) from a representative sample of the adult population aged 16 to 74. Conducted from 1996 to 2007, HEPS has subsequently been decommissioned and replaced by a module in the Scottish Health Survey 2008. NHS Health Scotland commissioned HEPS and fieldwork was carried out by BMRB International.

Data for this validation study came from Wave 12 (Autumn 2006) of the survey. Allowing for invalid addresses, a response rate of 66% was achieved. Interviews were carried out face to face, in people's homes, using Computer Assisted Personal Interviewing. In this data set 779 respondents completed all or part of WEMWBS, of whom 45.8% were male. The average age of respondents (767 with continuous age information) was 41.9 years (SD 16.05) and the range 16–74 years. As the Rasch analysis (see below) bases person estimates upon the information that is available, estimates can be given where missing values are present. However, the precision of the estimate is reduced to an extent depending on the number of missing items.

The Rasch model

In satisfying the axioms of conjoint measurement [20], the Rasch model shows what is expected of responses to items in a scale if measurement (at the metric level) is to be achieved. Dichotomous [14] and polytomous versions of the model are available [21, 22]. The model assumes that the probability of a given respondent affirming an item is a logistic function of the relative distance between the item location and the respondent location on a linear scale. In other words the probability that a person will affirm an item is a logistic function of the difference between the person's level of, for example, mental well-being, and the level of well-being expressed by the item. The model can be expressed in the form of a logit model:

ln ( P n i 1 P n i ) = θ n b i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiBaWMaeiOBa42aaeWaaKqbagaadaWcaaqaaiabdcfaqnaaBaaabaGaemOBa4MaemyAaKgabeaaaeaacqaIXaqmcqGHsislcqWGqbaudaWgaaqaaiabd6gaUjabdMgaPbqabaaaaaGccaGLOaGaayzkaaGaeyypa0JaeqiUde3aaSbaaSqaaiabd6gaUbqabaGccqGHsislcqWGIbGydaWgaaWcbaGaemyAaKgabeaaaaa@4323@

where ln is the normal log, P is the probability of person n affirming item i; θ is the person's level of mental well-being, and b is the level of mental well-being expressed by the item.

The process of Rasch analysis is described in detail elsewhere [23, 24]. Briefly, the analysis is concerned with how far the observed data match that expected by the model, using a number of fit statistics. In this paper, three overall fit statistics are considered. Two are item-person interaction statistics transformed to approximate a z-score, representing a standardised normal distribution. Therefore if the items and persons fit the model, a mean of approximately zero and a standard deviation of 1 would be expected. A third is an item-trait interaction statistic reported as a Chi-Square, reflecting the property of invariance across the trait. A significant Chi-Square indicates that the hierarchical ordering of the items varies across the trait, so compromising the required property of invariance.

In addition to these overall summary fit statistics, individual person- and item-fit statistics are presented, both as residuals (a summation of individual person and item deviations) and as a Chi Square statistic. In the former case residuals between ± 2.5 are deemed to indicate adequate fit to the model. To take account of multiple testing Bonferroni corrections are applied to adjust the Chi-square p value [25]. The same fit statistics are available to detect person deviation, as a few respondents significantly deviating from model expectation may cause significant misfit at the item level.

The proper ordering of response categories is also evaluated. Failure to follow an expected increase in response option consistent with an underlying increase in mental well-being would show disordered thresholds across categories within an item. The term threshold refers to the point between two response categories where either response is equally probable. For a given item the number of thresholds is always one less than the number of response options.

Within the framework of Rasch measurement, the scale should also work in the same way irrespective of which group (e.g. gender) is being assessed [26]. For example, in the case of measuring mental well-being, males and females should have the same probability of affirming an item (in the dichotomous case), at the same level of mental well-being. Thus the probability is conditioned on the trait. If for some reason one gender did not display the same probability of affirming the item, then this item would be deemed to display differential item functioning (DIF), and runs the risk of biasing results. For example, if items were biased for gender, then gender could not be used as a predictor variable for mental well-being, as the measurement of mental well-being would be confounded by gender bias. It is important to note that the detection of and, if necessary, the adjustment for DIF, does not remove the effect of gender, but rather ensures that there is no gender bias in the scale so that the effect of gender can be properly understood. In practice adjustments for such bias can be made post-hoc in most circumstances, but items displaying DIF would be prime candidates for removal in any scale revision [27]. Sometimes bias may cancel out in the test, for example, one item may favour males, another females, and their effects may be nullified [28]. In the current analysis, DIF was tested for age, gender, and the presence or not of a long-standing illness.

Strict tests of unidimensionality are undertaken at every stage of analysis [29]. A Principal Component Analysis (PCA) of the residuals is undertaken, the standardised person-item differences between the observed data and what is expected by the model for every person's response to every item. After extracting the 'Rasch factor' there should be no further pattern in the data. This is formally tested by allowing the factor loadings on the first residual component to determine 'subsets' of items and then testing, by an independent t-test to see if the person estimate (the logit of person 'ability' or, in this case 'mental well-being') derived from these subsets significantly differ from each other [29, 30]. If more than 5% of independent t-tests are found to be significant, allowing for a Binomial confidence interval for a proportion, this would indicate a breach of the assumption of unidimensionality.

An estimate of the internal consistency reliability of the scale is also available, based on the Person Separation Index (PSI) where the estimates on the logit scale for each person are used to calculate reliability. This is equivalent to Cronbach's Alpha [10].

In order to obtain robust estimates of the internal construct validity of the scale, the total data set is randomised into two further sets of approximately 50% of cases. Final results concerning the validity of the scale should be robust over the full data set, and each random sample.

The Rasch analysis was undertaken with the RUMM2020 software package [31].

Results

The 779 cases initially displayed no floor or ceiling effects, and thus all were entered into the analysis. The log Likelihood test Chi Square was 143.75 (df 38) with a probability < 0.0001, indicating that the partial credit version of the Rasch model was appropriate. All thresholds were found to be ordered (Figure 1). That is, within each item, the transition from one category to the next represents an increase in the underlying trait of mental well-being.

Figure 1
figure 1

Threshold map for the 14 item scale. (See additional file 1 for full text of items).

Initial fit to model expectations was poor (Table 1 – Analysis 1). The items 'I've been feeling good about myself', 'I've been interested in new things' and 'I've been feeling cheerful' all showed significant misfit to model expectations, and were deleted. This led to a marginal improvement in fit (Analysis 2). A further two items 'I've been feeling interested in other people' and 'I've had energy to spare' were deleted, resulting in further improvement (Analysis 3).

Table 1 Fit of data to the Rasch model.

Local dependency was then observed for two more items and, after further analysis, a strict unidimensional seven item scale was resolved (Analysis 4), comprising:

Item 1 – I've been feeling optimistic about the future

Item 2 – I've been feeling useful

Item 3 – I've been feeling relaxed

Item 6 – I've been dealing with problems well

Item 7 – I've been thinking clearly

Item 9 – I've been feeling close to other people

Item 11 – I've been able to make up my own mind about things

We have named this shortened scale SWEMWBS (Short Warwick-Edinburgh Mental Well-being Scale) (see additional file 2).

Five out of the seven items discarded showed significant DIF for gender (Table 2). For example, the item 'I've been feeling confident' (item 10) showed that, at any level of mental well-being, males were more likely to report a higher score than females (Figure 2).

Table 2 Differential Item functioning for gender
Figure 2
figure 2

Differential Item Functioning by Gender for the item 'I've been feeling confident'.

In the final seven item scale two items also showed DIF for gender, but these were found to cancel out at the test level, and fit improved further (Analysis 5). One further item (item 1) 'I've been feeling optimistic about the future' still displayed marginal DIF for age. None of the items in the 14 item WEMWBS showed DIF by the presence or absence of a long-standing condition. As might be expected with a shorter scale, the level of reliability had fallen from 0.906 (Analysis 1) to 0.845 (Analysis 5), although the original 14 item version is compromised by multidimensionality caused by gender bias.

Given satisfactory fit to the Rasch model for the seven item scale, and confirmation of strict unidimensionality, the robustness of the solution (analysis 5) was tested on the two random samples embedded within the data (Analyses 6 & 7). Both subsets of data showed good fit to model expectations. A linear transformation of the raw score, based upon the seven valid items, was then made. The raw score-logit transformation is given in Table 3. The Spearman's correlation between the raw scores of WEMWBS and SWEMWBS was 0.954.

Table 3 Raw score to metric score conversion table for SWEMWBS.

Finally, given the disturbance in model fit brought about by bias associated with gender, the data from the full 14 item scale was fitted to the Rasch model independently for each gender. Neither the males (Analysis 8) nor the females (Analysis 9) demonstrated fit to model expectations, suggesting that the disturbance to the scale was more than just gender DIF.

Discussion

Increasingly, scales used for measuring health and medical outcomes are being developed to meet the strict criteria associated with additive conjoint measurement as operationalised through the Rasch measurement model [14, 20]. Providing a scientific basis for the construction of linear measurement this approach is now widely used in the health and social sciences [32, 33]. It remains true, however, that the majority of scales commonly used to measure mental health in trials and population surveys have not been shown to meet these strict criteria.

Our analysis has shown that seven of the original 14 items of WEMWBS, which we have called SWEMWBS (Short Warwick-Edinburgh Mental Well-being Scale), conform to Rasch model expectations and provide a valid raw score – interval level transformation with a correlation of 0.954 to the full scale. Furthermore, SWEMWBS has been shown to be largely free of item bias, and that its polytomous response structure works as intended, with higher scores within an item reflecting greater overall mental well-being.

Although confirmatory factor analysis (not shown) had indicated that WEMWBS was consistent with a single underlying factor [8] the scale did not meet the criteria required of the Rasch model. Most of the seven items excluded showed bias for gender. Perhaps because of this DIF (which can be a cause of multidimensionality), it was not possible to construct a second meaningful scale from the seven deleted items. Separate analyses of the 14 item set by gender showed lack of fit to model expectations on both occasions, suggesting an underlying problem over and above the disturbance caused by gender DIF. In order to satisfy the rules for constructing interval scaling, the Rasch model imposes the strictest measurement criteria and. WEMWBS lack of fit to model expectations may have arisen either because of dimensionality issues, or because of the additional requirements for interval scale measurement over and above that required for ordinal scales.

WEMWBS was developed, in part, to support the evaluation of mental well-being programmes. The latter involve a component of education about the nature of mental well-being, which for many members of the public is a new concept. For this reason it was considered important that WEMWBS presented a full picture of mental well-being including items relating to the majority of aspects proposed in the academic literature. Face validity studies with the general public and its popularity with those practicing mental health promotion and public mental health in the UK suggest that WEMWBS met this goal.

In terms of face validity, the 7 item scale (SWEMWBS) presents a more restricted view of mental well-being than the 14 item scale (WEMWBS), with most items representing aspects of psychological and eudemonic well-being, and few covering hedonic well-being or affect. In terms of measurement properties, however, the 7 item scale (SWEMWBS) was robust to Rasch model expectations, whereas the original 14 item scale (WEMWBS) was not. The lack of measurement validity shown by half the items in the original 14 item scale may be attributable to current levels of knowledge and self-awareness relating to mental well-being among the general public resulting in responses which are not robust. As knowledge and self awareness increase this situation may change.

Given that SWEMWBS is embedded within the larger WEMWBS, it may be appropriate to continue to collect data on the full 14 items to further investigate dimensionality and gender bias in different samples. It would also allow for comparison, at the ordinal level, with earlier studies. However, our results clearly indicate that the 7 item scale is preferable to the 14 item scale where robust interval scale measurement is important, and respondent burden is an issue. To facilitate this, we have been able to provide a raw-score to interval scale transformation of the 7 item scale for use when change scores and other parametric procedures are required.

Conclusion

Although providing a broader view of mental well-being than the shortened version (SWEMWBS), WEMWBS does not meet the strict criteria for measurement demanded by the RASCH model, demonstrating DIF and multidimensionality. The shortened scale, comprised of 7 items (SWEMWBS), satisfied all criteria, including strict unidimensionality. A linear transformation of the raw score from SWEMWBS (Table 3) can be used with confidence in parametric analyses, given appropriate distribution. Responses to mental well-being scales may change as knowledge and self-awareness increase at population level. There are, therefore, arguments for continuing to gather data on the 14 item scale (given the seven item scale is embedded) to examine measurement of mental well-being at the ordinal level, to explore item bias in different samples, and to further analyse potential dimensionality.