1 Introduction

Over recent years there has been an increasing scientific, populist and political interest in notions of positive mental health and well-being (Dolan and White 2007; Huppert et al. 2005; Kahneman et al. 1999; Seligman 2002; Snyder et al. 2001). This widening interest in positive aspects of mental health has led to the development of new scientific constructs and also questionnaire items designed to measure positive well-being.

One of the most widely used survey instruments is Ryff’s multi-dimensional Psychological Well-being scales (PWB). It was specifically defined to measure positive aspects of psychological functioning along six theoretically-motivated dimensions: independence and self-determination (autonomy); having satisfying, high quality relationships (positive relations with others); the ability to manage one’s life (environmental mastery); being open to new experiences (personal growth); believing that one’s life is meaningful (purpose in life); and a positive attitude towards oneself and one’s past life (self-acceptance), (Ryff 1989a, b; Ryff and Keyes 1995).

Various versions of Ryff’s PWB have been extensively used in a variety of samples and settings. PWB items have been administered in large population-based samples such as the US National Survey of Families and Households (Sweet and Bumpass 1996), Midlife in the United States, (Brim et al. 2004), the Canadian Study of Health and Aging (Clarke et al. 2001), and the English Longitudinal Study of Ageing (ELSA). The Ryff items are also used as well-being outcomes in smaller surveys, such as the study of body consciousness (McKinley 1999), life challenges (McGregor and Little 1998) or midlife work aspirations (Carr 1997), as well as outcomes of therapeutic interventions (Fava et al. 2005).

There have been a number of psychometric studies of the multi-dimensional structure of the Ryff PWB (Abbott et al. 2006; Burns and Machin 2009; Cheng and Chan 2005; Clarke et al. 2001; Kafka and Kozma 2002; Lindfors et al. 2006; Ryff and Keyes 1995; Springer and Hauser 2006; Van Dierendonck 2004; van Dierendonck et al. 2008). Cumulative knowledge about the performance of the PWB is hard to glean since several versions are in use, varying in length from 120 to 12 items with varying degrees of item overlap. In general however, whilst the above psychometric studies provide some support for the a priori six dimensional structure, high inter-factor correlations are typically reported and some item have been found to cross-load on more than one factor (Clarke et al. 2001; Springer and Hauser 2006).

Several of these studies have sought to model the more problematic aspects of the PWB latent structure in some way to improve understanding of the items, and to improve model fit. For example, models have included method factors to address measurement artifacts due to item wording, or excluded items with low factor loadings or high cross-loadings (Abbott et al. 2006; Burns and Machin 2009; Cheng and Chan 2005; Springer and Hauser 2006).

In a recent study in a UK epidemiological sample of women, Abbott et al. (2006) undertook a detailed psychometric assessment of the 42-item PWB using factor analysis procedures appropriate for the ordinal response format of the Ryff items. Their findings broadly confirmed the six-factor structure; but method factors were necessary to achieve acceptable model fit and to ensure that construct variance was not obscured by common response tendencies to similarly worded items (Abbott et al. 2006). Further, in this sample the modelling supported the notion of a second-order general well-being factor defined by loadings from four of the six-first-order dimensions (environmental mastery, personal growth, purpose in life, self-acceptance). The high correlation between these four factors has also been confirmed in subsequent studies (e.g. Burns and Machin 2009).

All psychometric studies to date have been concerned with the construct validity of the PWB scales. Little is known about the precision of measurement of the PWB scales i.e. the accuracy of the scores that are derived from applications of this measure of well-being in different samples. To our knowledge, the reliability (or precision of measurement) of estimated PWB scores has not previously been investigated. Since there are many scales in existence which measure negative aspects of well-being by assessing symptoms of mental disorder, it is particularly important for a scale which purports to be about positive well-being to show high precision of measurement at high values of the scale. Ideally, for population wide measurement of individual variation in these new constructs, items need to be designed with response wording that enables well-being to be measured accurately across the range from low, through mid-range to high levels, so that the effective measurement range is as wide as possible.

The aim of this paper is to establish the precision of measurement of the 42-item version of Ryff’s Psychological Well-being scales. We use the parameters of an ordinal factor analysis model to profile the effective measurement range of the six PWB subscales and a second-order general well-being factor. Our conclusions are derived from graphical representations of score accuracy (precision) at each score value, calculated from the Test Information Functions for each subscale. Unlike a classical test theory based analysis, these plots show variations in score reliability, and determine where precision of measurement is highest or lowest, across the measurement range.

2 Methods

2.1 Sample

The Medical Research Council 1946 National Survey of Health and Development (NSHD), the first British birth cohort study is a stratified sample of singleton births occurring to married parents in England, Scotland and Wales during the week of 3–9 March 1946, (Wadsworth 1991; Wadsworth et al. 2006). The original sample comprised 5,362 individuals (2,547 women). Data have been collected regularly since childhood.

The representativeness of the study sample has been well documented at various points during their adult life follow-up (Wadsworth 1991; Wadsworth et al. 2006). For example, a comparison of the sample retained at age 43 and 53 with population census data has shown that the NSHD survey members are generally representative of the national population of a similar age (Wadsworth et al. 2003).

The sample for this study comprised women participants who participated in a health survey in midlife. This sub-study of women’s health in midlife (Women’s Health Survey (WHS)) was undertaken annually, by postal questionnaire, between the ages of 47–54. This study included 1,778 (70%) of the original cohort of women; the others had died (6%), previously refused to take part (12%) or lived abroad and were not in contact with the study or could not be traced (13%).

2.2 Measures

A 42-item version of Ryff’s Psychological Well-being scales were included in the WHS at age 52 and sent to 1,421 women who had completed at least one WHS survey in the previous 2 years. The 42-item version of the PWB was selected for use in this sample on the personal recommendation of Ryff. The PWB was not administered to men in this cohort. The response format for all PWB items comprised six ordered categories labelled from ‘disagree strongly’ to ‘agree strongly’. Twenty PWB items comprised positive item content and 22 had negative item content. Prior to analysis, items with negative content were reverse scored so that high values indicated well-being. Full question wording of the 42-items is shown in Table 1.

Table 1 Frequencies of response categories and descriptive statistics, Ryff psychological well-being scales (N = 1,179)

2.3 Factor Analysis Modelling of Ordinal Responses (Latent Variable Probit IRT)

An underlying variable approach to ordinal data factor analysis implements a probit (normal ogive) item response model with thresholds defined on a latent continuum that capture changes in response level. Factor analysis of these underlying variables can then be carried out using traditional linear factor analysis methods applied to polychoric correlations. This approach has been popularised by Muthén (1984) and Jöreskog (1990), with related work by McDonald (1999). Software for binary and ordinal item factor analysis (equivalent to an item response model) is now widely available for example, LISREL (Jöreskog and Sörbom 2004), Mplus (Muthén and Muthén 1998–2004) and NOHARM (binary items only) (Baker 2001).

Here we applied the ordinal probit item response model to a 42-item version of Ryff’s Psychological Well-being scales using Mplus software, Version 4.2 (Muthén and Muthén 1998–2004). Parameters were estimated using robust, weighted least squares (rWLS) methods that rely only on the bivariate associations among items. This method was undertaken rather than one based on full information maximum likelihood (FIML) as a FIML model would have been computationally unmanageable due to the large number of factors (6 construct + 2 method) (See: http://www.statmodel.com/discussion). We would not expect our conclusions to differ if a full information approach had been used since many other studies have compared results under these approaches with fewer factors (Bartholomew and Knott 1999; Takane and de Leeuw 1987).

Parameter estimates for the factor loadings, which quantify the discriminating power of each item, and probit thresholds from Mplus output are used to calculate Test Information Curves (TICs), that represent the accuracy of each estimated score value across the measurement range of the instrument. TICs are usually associated with item response theory perspectives based on FIML estimation procedures, but are also calculated by the ordinal factor analysis procedures in Mplus and can be plotted using graphics commands.

TICs are also commonly referred to as Scale Information Functions (SIFs) and are derived from the amount of Fisher information i.e. the reciprocal of the square root of the posterior standard deviation of the estimated score (posterior mean), (Baker and Kim 2004). The Scale Information Function is generally displayed graphically, along with associated plots, such as the conditional standard error of measurement (SEM) calculated from the inverse of the square root of the information function. These plots are important in the interpretation of measurement precision across the latent trait continuum. They identify the point on a scale, or the range of scale values, which are measured with high precision i.e. where standard errors of measurement are low, and score accuracy therefore high. The point on the measurement continuum where standard errors start to increase indicates less precise measurement and lower score reliability or accuracy. This can occur at either end, and conceivably (but rarely) in the middle range of scores.

Our modelling of the Ryff PWB scales was based on a previously reported confirmatory factor analysis of the Ryff items which evaluated alternate factor structures of the scale (Abbott et al. 2006). This study suggested that the addition of method factors to address methodological artefacts due to positive versus negative item content significantly improved model fit compared to the a priori six-factor structure. In addition, four of the six dimensions of well-being (environmental mastery, personal growth, purpose in life, and self-acceptance) were sufficiently highly correlated to warrant a second-order general well-being factor. The remaining two factors, autonomy positive relations with others were more independent. Other minor model modifications included the omission of two poor-fitting items from personal growth (resulting in a 40-item scale), moving one item from environmental mastery onto positive relations, and allowing correlated residuals between two items from positive relations (see Abbott et al. 2006 for further details). This model is graphically represented in Fig. 1. It should be noted that all six-first-order factors are correlated (with residual correlations ranging from 0.25 to 0.85). The method factors are uncorrelated with the latent constructs. Small amounts of missing item level data were present (957/1,179 had complete data on all items) but not in great enough numbers to influence the results. rWLS estimation in Mplus includes partially missing data under a MCAR assumption.

Fig. 1
figure 1

Psychological Well-being modified 40-item model with second-order factor and method factors. Residual correlations between the six PWB latent variables ranged from 0.25 to 0.85 (not displayed due to model complexity). Goodness of fit: Chi Square: 2.46 (df = 255), TLI = 0.94, RMSEA = 0.086, WRMR = 2.01

3 Results

The analysis sample included 1,179 respondents who completed at least 85% of PWB items (36 out of 42 questions); 957 had complete data on all items. Table 1 shows the percentage of participants responding to each of the six-point Likert style response categories. Negative items have been reverse scored so all items measure well-being in the same direction, i.e. one indicates the lowest level of well-being and six the highest. There were few responses to the lowest levels of well-being, particularly for questions with positive item content. For eight of the items, over 40% of respondents endorsed the highest (most positive) category, indicating ceiling effects on these items. Treating responses as continuous, interval scores, as in traditional factor analysis approaches to psychometric analysis, yields mean scores and corrected item-total correlations (CITC) by subscale, these are also reported in Table 1.

3.1 Item Factor Loadings and Thresholds

Factor loadings estimated simultaneously for the six-first-order factors and then for the general well-being factor, are reported in Table 2. These loadings identify the relative discriminating power of each item as a measure of the intended latent construct. In factor analysis models for Likert scored items, an item’s discriminating power is captured by the magnitude of the factor loading (the correlation of the factor with the underlying variable).

Table 2 IRT parameters—psychological well-being, 40-item model (N = 1,179)

Only ten of the forty Ryff items included in the analysis were highly discriminating indicators of their latent factor i.e. loaded above or equal to 0.70 (Table 2; column 3 ‘standardised’). This indicates that the latent constructs explained around 50% of the variation in responses for only a quarter of the items (A7−, R2−, R4−, R5−, E6−, G4−, G5 + , P3−, S1−, S3−). We note that these discriminating items were distributed across all six-factors, and with the exception of one from personal growth, were characterised by negative item content.

Conversely, seven items displayed low discriminative ability, with factor loadings of < 0.40. These items were spread across five of the six dimensions, the exception being purpose in life where all items loaded > 0.40. Four of seven low discriminating items had positive item content (R1 + , E4 + , E5 + , S2 +) and three had negative item content (A5−, G1−, G7−).

Overall, items with negative content were found to have higher factor loadings than items with positive content. Three-quarters of the items with negative content had factor loadings of more than 0.50 compared to only half of the items with positive content. Table 2 also shows the five thresholds corresponding to the distinctions between the six ordinal category response options.

3.2 Conditional Standard Error of Measurement of PWB Subscales

Figure 2 shows the conditional standard error of measurement (SEM) along the y-axis for each a priori subscale of the PWB. Here the x-axis represents the population continuum of estimated latent trait scores in terms of a standardized normal distribution (Mean = 0, SD = 1).

Fig. 2
figure 2

Conditional Standard Error of Measurement—Psychological Well-being (40-item model). Based on modified version of Ryff’ 42-item PWB with 40-items. N = 1,1,79

The subscales here comprise between five (personal growth) and eight (positive relations) items. Each first-order PWB factor measured well-being with only modest precision. Within the mid-range of the distribution (+1.0 and −1.0 SD) which represents 68% of the population, conditional SEM values were relatively flat for all factors and generally below 0.6 in value. Three of the factors displayed slightly higher levels of measurement precision (positive relations, personal growth and purpose in life). Within the mid-range score accuracy for the self-acceptance subscale was lower than any of the other factors ranging from a conditional SEM of 0.65 at the Mean to 0.8 at +1.0 SD.

At the negative end of the well-being spectrum (low well-being) reliability of scale scores was lower for all factors, compared to scores in the mid-range (average well-being). Measurement precision for positive relations and personal growth diminished rapidly once x-axis values exceeded −1.5. At the positive end of the continuum (i.e. high well-being) estimated well-being scores had lower precision than for scores across the rest of the range. Measurement precision for positive relations with others, which was relatively high around the mean, diminished rapidly with increasing values on the x-axis. A similar, but slightly less extreme pattern was apparent for purpose in life, personal growth and environmental mastery. Measurement precision for autonomy and self-acceptance also declined at both ends of the spectrum but less rapidly than the other factors. Of the six-first-order factors, self-acceptance appeared to be the dimension measuring well-being with the greatest accuracy across the widest range of the population well-being continuum, with conditional SEM values displaying a relatively flat profile from 2.0 SD below to 2.0 SD above the mean (a wide effective measurement range). However these score SEMs are not small enough for precise statements to be made about individual scores, only for populations or groups.

Finally, we examined the impact of capturing well-being at a more general level of analysis, by presenting the conditional SEM based on a second-order well-being factor. It can be seen from Fig. 2, that at the population mean (x-axis = 0), this composite general well-being factor achieved slightly better precision, than each of its four component factors. Importantly, the combination of the items from these four factors increases the precision of measurement beyond the mid-range of the well-being continuum, yielding a relatively flat profile from +2.0 to −2.0 SD of the mean. Examination of individual latent trait scores shows that on the general well-being factor only one individual scored at the lowest estimated latent trait level (−1.90), and 4 individuals at the highest level (+1.71). This suggests that the score distribution estimated from this more general factor model is not compromised by either a floor or ceiling effect despite responses to individual Ryff items often concentrated at the largest score values (see Table 1).

4 Discussion

Ideally, survey instruments used to measure psychological constructs in general population surveys and epidemiological studies should contain item phrasing, response wording and sufficient response categories that enable accurate and precise scores to be estimated across a wide measurement range. New measures designed to evaluate well-being in populations, rather than clinical samples, should differentiate between individuals with high levels of well-being, as well as between those with medium and low levels. The Ryff scales of psychological well-being (Ryff 1989b; Ryff and Keyes 1995) were designed to measure a continuum of positive psychological functioning (Ryff et al. 2006) but no psychometric study has commented on score accuracy or effective measurement range. Psychometric models specifically calculate these properties and our graphs display them for the first time.

We expected that plots of the conditional standard error of measurement would yield relatively flat curves and span a wide range of the x-axis values, indicating precise measurement right across the population continuum. However, our results were not so clear cut. We used item response theory to report the measurement range of the PWB. An ordinal (normal ogive) item response model using an underlying variable approach was applied to the six PWB subscales (autonomy; positive relations with others, environmental mastery, personal growth, purpose in life and self-acceptance) to quantify and plot the conditional standard error of measurement for each score value. We based our IRT analysis on recommended model modifications which have been shown to result in a parsimonious solution for the 42-item version of the PWB (Abbott et al. 2006). This analysis demonstrated that for each subscale, information was concentrated in the middle of the measurement range, i.e. around the population average. Score precision diminished at high and low levels of well-being, but low well-being was measured more reliably than high well-being. Only a quarter of the individual items displayed high discrimination (factor loadings > 0.7). All but one of these items contained negative item content, in other words, the most discriminating items measured well-being by questions about its absence and response disagreeing with such statements.

In previous work in this national epidemiological sample, we have shown that four factors (environmental mastery, personal growth, purpose in life and self-acceptance) were highly inter-related and could be parsimoniously modelled by a second-order factor with autonomy and positive relations remaining as more independent (Abbott et al. 2006). Our IRT modelling of this second-order well-being factor revealed that it had higher measurement precision across a wider range than the component subscales. This is to be expected because the number of items was four times greater, but importantly, measurement precision for the second-order factor only diminished at the extreme ends of the well-being continuum (beyond +2.0, −2.0 SD) and there were no ceiling or floor effects.

Our second-order measurement model provides some support to claims that there could be less than six dimensions under-pinning psychological well-being. The items included within the second-order factor cover aspects of goal orientation and self-direction which could be attributed to a more motivational dimension of well-being. The resulting structure of a motivational dimension of well-being (covered by our second-order factor) together with the two independent factors of autonomy and positive relations is in many ways analogous to the work of Deci and Ryan (1985, 2000) which proposes that well-being results from the fulfillment of three basic psychological needs – autonomy, relatedness and competence. However, it should be emphasised that our interpretation of the PWB structure should be seen in terms of the hierarchical organisation of the six-factors, which span two conceptual levels and not three independent factors.

The NSHD sample used for this analysis was homogeneous for sex (all women) and age (52 years). This sample represents the surviving members of this general population cohort study of health and development who have been studied from birth and also completed postal questionnaires in midlife (Wadsworth et al. 2006). Comparable data were not available for male cohort members since the PWB scales were administered as part of a study of women’s health around the menopause.

4.1 Recommendations

In summary, our findings suggest that for more reliable measurement of ‘high well-being’ the PWB requires the identification and addition of questions which tap into the more positive end of the well-being continuum. We propose that the second-order factor offers a potentially more reliable measure for capturing variations in well-being across the continuum than derived from the component subscales. Further work on the validity of the second-order measure is now warranted, particularly across other population samples which include men, and also different age groups. Finally, more in-depth theoretical work on the underlying dimensional structure of psychological well-being is required, particularly in light of the similarities identified between the second-order PWB factor model and Deci & Ryan’s three dimensional well-being model.

5 Conclusion

In light of the growing interest in the measurement of well-being amongst researchers, practitioners and policy makers (Dolan and White 2007; Huppert et al. 2009; Layard 2005; Marks and Shah 2005) there is a pressing need for scales which can measure well-being effectively across the full spectrum. Our analysis has shown that the subscales of the Ryff’s six dimensional PWB, adequately measure average levels of well-being, but have low precision of measurement at high levels. Whilst we support the use of the second-order factor as a general measure of well-being, we recommend that future well-being scales should be designed to include items that discriminate more reliably at high levels along the well-being continuum.