1 Introduction

Recent years have shown a proliferation of studies using various measures of happiness and life satisfaction, making it perhaps one of the most stimulating new developments in the social sciences (Frey and Stutzer 2005; Kahneman et al. 2004a). Recent government initiatives in countries such as France, through the Commission on the Measurement of Economic Performance and Social Progress (Stiglitz et al. 2009), the United Kingdom, through the Office of National Statistics (Dolan et al. 2011), or the United States, with Federal Reserve Chairman Ben Bernanke declaring his interest in finding better measurements of Americans’ well-being (Rugaber 2012), have further spurred a debate in the scientific community.

The majority of findings on subjective well-being are based on evidence from global life satisfaction measures used in large scale surveys. Throughout the literature, these findings have raised methodological concerns, as minor events and moods may influence responses to those questions, though there is a lack of consensus regarding the extent of such context effects (Schwarz and Strack 1991, 1999; Schimmack and Oishi 2005; Eid and Diener 2004). Global life satisfaction scales have produced widely conflicting findings. A prominent example is the so-called Easterlin paradox, where some authors found that happiness levels across countries show no relationship with the level of economic development of a country (Easterlin 1974, 1995), while others found a monotonic relationship between economic development and subjective well-being (Deaton 2008; Kahneman and Deaton 2010; Stevenson and Wolfers 2008).

Apart from global life satisfaction, other alternative subjective well-being measures have also been proposed in the literature. Although their classification has been somewhat controversial (Kahneman and Riis 2005), most of the psychology literature thus far has conceptualized subjective well-being either as the evaluation of life satisfaction/dissatisfaction (evaluative well-being measures) or as the combination of experienced affect—range of emotions from joy to misery—(experienced well-being measures). These two types of well-being measures are the focus of this paper. We also added, however, a third type of measure, a ‘eudemonic’ category to our study to fit the United Kingdom’s Office for National Statistics classification (Dolan et al. 2011) as will be explained below.

Broadly, the evaluative component of subjective well-being includes the elicitation of a respondent’s global subjective evaluation of his or her life, where the evaluation can also be limited to specific domains of life, such as satisfaction with work, family life, or health (Dolan et al. 2011). Typically, these questions are formulated as single item self-reports, formulated for example as “All things considered, how satisfied are you with your life as a whole these days?” or “Taken all together, would you say that you are very happy, pretty happy, or not too happy?” (Krueger and Schkade 2008). More recent surveys however have included multiple questions eliciting evaluative well-being. Perhaps most widely used among the latter is the Satisfaction with Life Scale (SWLS), which measures life satisfaction by asking respondents to report their level of agreement with five statements on a seven-point response scale from strongly disagree to strongly agree (Diener 2000; Diener et al. 1985). Though the response time to single global life satisfaction questions is lower than for multi-item measures, as one would expect, the latter appears to be more reliable. Typically, it is assumed that life satisfaction should not show large variation within short periods of time. When evaluating the reliability of evaluative measurements over time, the SWLS displays an estimated reliability—that is, the correlation across waves—of about 0.8 (Eid and Diener 2004; Krueger and Schkade 2008), compared with single item global life satisfaction measures that have an estimated reliability of about 0.60 (Andrews and Whithey 1976; Krueger and Schkade 2008). Evaluative questions are the most frequently used survey items within the field of subjective well-being (Kahneman and Krueger 2006). For instance, most of the large longitudinal ageing surveys have included this type of life satisfaction measures in their questionnaires. The Health and Retirement Study (HRS) and the English Longitudinal Study of Aging (ELSA) include Diener’s five-item SWLS (Diener et al. 1985). The HRS and the Survey of Health, Aging and Retirement in Europe (SHARE) include a single item overall life satisfaction question in their core interviews. Other measures of evaluative well-being often used in studies include Campbell’s domain-specific life satisfaction (Campbell et al. 1976) used in the Gallup Wellbeing Index: Standard of Living and Personal Life, and the Cantril Self-Anchoring Striving Scale (Cantril 1965), often referred as Cantril ladder, used by the Gallup poll and the OECD.

While evaluative life satisfaction questions have been widely used, their meaning and research application remain a matter of debate. Life satisfaction is a global retrospective judgment, cognitively demanding, and likely constructed only when asked. Respondents may thus base their answer on heuristics, their current mood and memory (Kahneman and Krueger 2006; Schwarz and Strack 1999). The difficulty of investigating such effects is made obvious by Lucas and Lawless (2013), who note that while weather has often been found to affect the mood and life satisfaction of respondents, this may have been the result of different climates, or time of the year, as they find no effect of weather itself in a large scale study. In contrast to evaluative subjective well-being measures that require an evaluative judgment from respondents, experienced well-being measures focus on how respondents are feeling (positive and negative affect) at a specific point in time. These experienced measures correspond to a rather Benthamite view of well-being, in that the latter depends entirely on individuals’ feelings, though the list of feelings used in surveys is usually not limited to pleasure and pain (Dolan et al. 2011). Experienced well-being is thus based on real-time affect measurements (Kahneman et al. 2006).

Ecological Momentary Assessment (EMA) aims at a “repeated collection of real-time data on subjects’ behavior and experience in their natural environments” (Shiffman et al. 2008, p. 3). The term EMA was coined by Stone and Shiffman (1994). Such data can be collected by a variety of methods, including time based designs whereby for instance subjects are prompted at random intervals to record their activities or mood, or event based designs whereby subjects are asked to provide information after specific events. Although EMA can be applied in pretty much any domain of human activity that one wants to measure in real time and in individuals’ natural environment, our interest here is primarily in the measurement of affect. Frequent measurements permit the detection of variation in affect over time and during particular activities, and thus yield high reliability and validity of measures (Csikszentmihalyi and Hunter 2003). EMA may be costly however, and may place a high burden on respondents (Kahneman and Riis 2005).

The Day Reconstruction Method (DRM) has been developed to offer some of the advantages of EMA while being more practical, by combining a time-use survey with questions about the previous day (Kahneman et al. 2004b). DRM surveys can include details such as the type of activity, location, presence of other individuals and experienced affect for all activities listed by a respondent in his diary, or only for a subset, e.g. three randomized times or activities throughout the day, as the Princeton Affect Time Use Survey (PATS) or the American Time Use Survey (ATUS) have implemented. While the DRM involves the retrospective report on an emotional state, this survey design targets accurate recall, by leading respondents to retrieve specific episodes and emotions from memory (Kahneman et al. 2004a). DRM is in some sense more complete than EMA, as it attempts full coverage of the day, whereas EMA samples several moments during the day. Studies have validated the results obtained through the DRM by comparing them with experience sampling methods (Kahneman and Krueger 2006). Other surveys, such as the Gallup World and Daily Polls aim at measuring experienced well-being simply by asking respondents about emotions experienced during the whole previous day; this then elicits emotions aggregated over many episodes during a day.Footnote 1

Throughout the literature, the complementarity of evaluative and experienced measures of well-being is explained by the fact that both measures are likely correlated, though remaining empirically and conceptually different (Kahneman and Riis 2005). However, more research is needed to understand how the concepts experienced well-being measures are capturing differ from those captured by evaluative measures. Comparing these two types of measures is one of the objectives of this paper.

Finally, the last category of well-being measures we will consider in this paper refers to “eudemonic” survey items. Eudemonic measures refer to the existence of underlying psychological needs, encompassing various dimensions of wellness, such as autonomy, personal growth, or purpose in life, which contribute towards well-being independently of any positive affect they may convey (Dolan et al. 2011; Ryff and Keyes 1995). Ryff presents evidence of a certain degree of convergence between these “theory-guided” eudemonic well-being measures with the commonly used life satisfaction measures (Dolan et al. 2011; Ryff 1989). The question “Overall, to what extent do you feel that the things you do in your life are worthwhile?” is an example of eudemonic measure currently used by the Office of National Statistics in the UK (Dolan et al. 2011).

Overall, as pointed out by Krueger and Schkade (2008), relatively little attention has been paid to the reliability of experienced measures. While each existing measure of subjective well-being appears to show some evidence of validityFootnote 2 and to capture distinct dimensions (National Research Council 2013), the differences between the measures of well-being have not been explored systematically. This paper aims at filling these gaps in the literature by studying the results of two waves of well-being data we collected in the RAND American Life Panel (ALP). In particular, we designed two experimental modules that were fielded in the ALP including some of the evaluative and eudemonic well-being measures described above, as well as a number of experienced measures. Our objective when choosing the measures for our questionnaires was to represent common well-being measures, often used in existing studies, and with different time requirements for the respondents, in order to be able to compare the concepts they are capturing. Another important comparison we study is the use of different response scales for the elicitation of well-being measures. Although the concepts asked in the different measures are in some cases the same, measures differ in the response scales used and so, we will study the correspondence across these different response scales. Results of this analysis will be useful to inform studies that aim at including these different measures.

The remainder of the paper is structured as follows. The next section describes the data we have collected and the experiment we have designed and implemented. Section 3 provides descriptive statistics as well as measures of reliability for various subjective well-being measures. In Sect. 4 we use factor analysis to explore the relation between those measures. Section 5 focuses on the effect of different response scales on the dimensionality of subjective well-being found when applying factor analysis. Section 6 compares how evaluative and experienced well-being measures differ in how they correlate with demographics. Section 7 concludes.

2 Data and Experiment

2.1 The RAND American Life Panel (ALP)

To conduct this research, we use data collected in the RAND ALP. At the time of the survey, the ALP consisted of approximately 5,500 respondents ages 18 and over who were interviewed periodically over the Internet. Respondents do not need Internet access to participate, although the majority of the panel members have their own Internet access. The remaining panel members (approximately 10 % of the sample) have been provided Internet access by RAND through the provision of a laptop or a Microsoft TV2 and/or an Internet subscription, eliminating the bias found in many Internet surveys that include only computer users (Chang and Krosnick 2009; Yeager et al. 2011). The TV2 is an Internet player that allows respondents to open email accounts and browse the Internet. Sampling weights are also provided by the ALP to adjust for sample selection. Upon joining the panel, respondents complete an initial survey collecting individual socio-demographic information, work history and household composition information. They are asked to update their background information every quarter. About once or twice a month, respondents receive an email with a request to fill out a questionnaire. Response rates average 70–80 %. Since January 2006, researchers have fielded over 300 surveys, and published papers using these data on a wide variety of topics, for instance subjective probabilities and expectations (Delavande and Rohwedder 2008; Manski and Molinari 2010), life satisfaction (Kapteyn et al. 2010), financial literacy (Bruine de Bruin et al. 2010; Fonseca et al. 2012; Lusardi and Mitchell 2007), and Presidential election polling (Gutsche et al. 2014).

Apart from its flexibility and cost effectiveness in collecting new data, an important advantage of the ALP is that it also allows for experimentation, e.g. by administering different surveys or different tasks to randomly selected subgroups. We make use of this feature in this paper by designing two experimental modules that were administered in the ALP. The first module was administered from the beginning of May 2012 until July 2012, while the second module started to be administered at the end of May 2012 and was in the field until early August 2012. 4,339 respondents answered our module for the first wave out of 5,495 eligible respondents, resulting in a response rate of 79 %. Respondents who completed the first wave were then invited to answer questions in the second wave. Out of 4,336 eligible respondents (three respondents of the first wave were not available for the second wave), 4,031 respondents answered the module for the second wave, resulting in a response rate of 93.3 %. The following sections describe the well-being measures collected in these modules as well as the experiment that we designed and implemented.

Finally, in administering the two modules in the ALP, we made sure that there would be at least 2 weeks between the waves. That is, a respondent would only be eligible for answering the questions in the second wave at least 2 weeks after this respondent had responded to the first wave. Respondents can answer question at a time that is convenient to them and as a result the time gap between the first and second waves may vary substantially. Figure 1 shows a distribution (in days) of the time gap between waves for the respondents in our sample. As per the protocol, the time gap is at least 14 days, with a very long tail, reflecting the fact that we kept the second wave in the field for a long time. The mean time gap between waves is 26.6 days with a standard deviation of 10.5 days. The peaks at 23 days and 30–39 days reflect email reminders to panel members, which had a noticeable effect on number of responses.

Fig. 1
figure 1

Time gap in responses between first and second wave

2.2 Well-Being Measures in Our Questionnaires

In the two modules we fielded in the ALP, we administered four sets of evaluative well-being measures and three sets of experienced well-being measures.Footnote 3 The evaluative well-being measures in our modules include the following: Diener’s five-item SWLS (Diener et al. 1985), in exactly the same form as it is included in the HRS and the ELSA; a single item overall life satisfaction question, identical to the one included in the SHARE; Campbell’s domain-specific life satisfaction (Campbell et al. 1976) used in the Gallup Wellbeing Index: Standard of Living and Personal Life, and the Cantril Self-Anchoring Striving Scale (Cantril 1965), often referred to as Cantril ladder, used by the Gallup poll and the OECD. In addition to these, we also included four questions from ELSA based on those collected by the UK Office of National Statistics (ONS) which comprise one evaluative life satisfaction question, one eudemonic question and two experienced well-being questions related to feelings of happiness and anxiety during the previous day. Although two of the ONS–ELSA questions are experienced well-being questions, in our experiment they are included in the evaluative measures group, as we seek to maintain a questionnaire structure as close to the original ONS questionnaire as possible. We will see however that in the analyses these questions behave differently than the evaluative measures, as one would expect.

Our ALP modules also included three sets of experienced well-being measures to be compared with the evaluative well-being measures described above, as well as among themselves. Our first set of experienced well-being measures comes from ELSA’s simplified version of the DRM collecting information about activities in the last day and how individuals felt when doing these activities. Our second group of experienced questions is based on the Gallup-Healthways well-being index. These questions collect information on a number of measures capturing positive and negative affect experienced yesterday. Finally, we also included questions from the so called HWB12, a newly developed experienced well-being measure by Smith and Stone (2011), which has been included in the 2012 wave of HRS. The HWB12 is a measure of 12 overall experiences of hedonic well-being referring to the previous day. The authors recommend asking wake and sleep times as a minimal check that participants focus attention on remembering the previous day and so, we also did. Finally, in order to facilitate the crosswalk across different experienced measures we added different sets of additional questions to each of the experienced measures included in our questionnaire as is explained in more detail in the following sections.

2.3 Experiment

We fielded two waves of the ALP where we administered four evaluative well-being measures and three sets of experienced well-being measures. All evaluative well-being questions were asked in both waves.Footnote 4 The experienced well-being measures show considerable overlap in the adjectives used in constructing the measures. To avoid contamination of responses within a wave, respondents answer only one set of experienced well-being measures in each wave, randomly assigned. Since there are only two waves, no one responds to all three experienced measures. We do make sure however that all possible combinations of experienced measures occur across the two waves. To be more precise, respondents are randomized into one of nine different groups for the experienced well-being measures: group 1–1 for example will see the Gallup questionnaire in both waves, while group 2–3 will see the ELSA questionnaire in the first wave, and the HWB-12 questionnaire in the second wave. This will apply for all combinations, i.e. 1–1, 2–2, 3–3, 1–2, 1–3, 2–1, 2–3, 3–1, 3–2.

As shown in the “Appendix”, for each set of experienced measures, respondents get a number of additional questions. The reason for this is as follows. The experienced measures differ in a number of ways. These include differences in the list of included items and differences in response scales. To be able to isolate the effects of differences in items and differences in response scales, we have added items to each of the experienced measures such that in each case a respondent answers exactly the same number of items. This allows us to look at both the effect of response scales (the different measures have different response scales, but the respondent answers the same number of items for every response scale) and the effect of the item choice (we can compare results with and without additional items; the additional items always come after the original set of items).

3 Descriptive Statistics

Table 1 shows the response duration of different well-being measures we collected for the modules included in the ALP. Since respondents don’t have to take a survey in one sitting, total survey times sometimes may seem extremely long. To exclude such cases we omit observations for which total time exceeds 30 min (taking a more generous limit, like 1 h, does not change results much). The table shows that the 15 concordance items in the experienced well-being measures using the HWB12 or Gallup response scales take <2 min on average. The same 15 items using the ELSA response scale takes more than 4 min. A seven-point response scale experienced question such as “Yesterday, did you feel happy” takes about 17 s to answer, while the same question asked on a five-point response scale (as in HWB12) takes about 8 s, and a binary response scale question (as in the Gallup questions) takes about 8 s as well. The evaluative measures (Cantril, Diener, SHARE and ONS) take very little time, not surprisingly. There is not much difference in duration across the waves.

Table 1 Duration in minutes of different well-being modules

3.1 Test–Retest Reliability of Measures

An important question of interest when fielding a survey on subjective well-being questions is the reliability of the resulting measures. We follow Krueger and Schkade (2008), and use a classical measurement error model \(y_{i} = y_{i}^{*} + \epsilon_{i}\), where y i is the observed well-being item measure, y * i is the true value of the well-being item measure and \(\epsilon_{i}\) is an error term assumed to have expectation zero. This set-up suggests a definition of the reliability ratio as the correlation coefficient of measures across waves \(\left({r = corr\left({y_{i}^{1}, y_{i}^{2}} \right)} \right)\), where the superscripts refer to the waves in which the variables are measured. The reliability is thus measured here as a test–retest correlation between two waves of data, where the interval in our sample is at least 2 weeks and 26.6 days on average, as mentioned with respect to Fig. 1.

Table 2 shows the reliability ratios for all the evaluative subjective well-being measures. The Diener SWLS shows a reliability of about 0.80, which is very close to the estimate of 0.82 by Diener et al. (1985) who used an interval of 2 months, and the estimate by Alfonso et al. (1996) of 0.83, where the interval was 2 weeks between both measurements. As one would expect, the single item scales for evaluative well-being yield somewhat lower correlations, on the order of 0.67. The two ONS questions about yesterday are, as discussed earlier, really experienced measures, and we observe lower correlations amongst them between two waves. This reflects the fact that the specific reference to “yesterday” may pick up real changes in affect between different days. The Gallup measures referring to 5 years ago or 5 years in the future show lower reliability ratios than the one referring to the present, indicating possible error in recall of one’s situation 5 years ago and uncertainty about one’s future.

Table 2 Reliability ratios of the evaluative subjective well-being measures and ONS experienced measures (n = 3,938)

We also looked at correlations between the measures for experienced affect on the previous day presented in Table 3. As expected, we found lower correlations between waves, as changes may reflect both random measurement errors and true changes in the affect measures between the 2 days. Notice that the table shows correlations for all items, i.e. we include both the original items of each scale and the items added from the other scales. The correlations for those are underlined. Since we don’t use the ELSA limited DRM, but rather the ELSA response scale for all items that are in either Gallup or HWB12, all ELSA items are underlined. Recall that we did this so that we are able to compare response scale effects across a common set of items. Thus, a point of interest is to relate differences in correlations to differences in response scales (both the wording and the number of points on the response scale).

Table 3 Reliability ratios across waves of experienced subjective well-being measures

The binary response scale used in the Gallup survey shows somewhat lower correlations across waves overall, with correlations between 0.28 and 0.49, in comparison with the five and seven point response scales used in the HWB-12 and ELSA questionnaires respectively. The ELSA response scale shows correlations ranging from 0.33 to 0.55, while the HWB12 response scale shows correlations between 0.42 and 0.59.

4 The Relation Between Evaluative and Experienced Well-Being Measures

There is a lively debate in the literature on the dimensions of well-being and what different measures are capturing (for a review, see Diener 2000). Uniquely, our data bring together many of the currently used subjective well-being measures and thus allow us to investigate how they are related. To determine the relation between the various measures we conducted a number of different factor analyses.

As noted, we have all evaluative measures for all respondents, but each experienced measure is only available for a randomly chosen five ninth of the sample. In their original form, the Gallup and HWB12 measures are straightforward to use, since they produce ratings of a number of affect items. The ELSA questionnaire is more complicated to analyze as it asks for ratings for a number of activities during the previous day, so we use only the ELSA response scale for comparison with the response scales used by Gallup and in HWB12. In the current section the purpose is to consider the items in the original scales so we concentrate therefore initially on analyses of the Gallup and HWB12 measures. The ELSA response scale will be evaluated when studying the concordance items, which can be found in all three experienced well-being measures. Both analyses cover all evaluative measures as well as their respective experienced measures. We performed a factor analysis using principal components. In all cases factors are rotated orthogonally using the varimax method while we retain factors with eigenvalues greater than one.Footnote 5

Table 4 presents the results for the Gallup case. The evaluative measures are grouped together in the upper part of the table and the Gallup experienced measures at the bottom. Factor loadings represent the direct effects of the factor on the observed variable (Bollen 1989). Large factor loadings (i.e. the largest number in absolute value on each row) are indicated in bold.

Table 4 Factor analysis: evaluative well-being and Gallup (original) experienced well-being (n = 2,724)

Using the criterion of only retaining factors with eigenvalues greater than one,Footnote 6 three factors are retained. The results confirm that evaluative and experienced well-being are distinct concepts. The evaluative measures form one factor, while the Gallup experienced measures appear to represent two factors. The factors representing experienced well-being form one positive and one negative affective dimension thus confirming that negative affect is not just the opposite of positive affect. This confirms prior findings of positive and negative affects as highly distinctive, orthogonal dimensions—not opposites that would be strongly negatively correlated—so that individuals can be experiencing both positive and negative affect simultaneously (Watson et al. 1988; Tuccitto et al. 2010). ONS-happy (“Overall, how happy did you feel yesterday?”) loads mainly on the evaluative first factor. Although the phrasing of the question would squarely put it in the experienced well-being domain, its location in the survey (right after an evaluative question, see “Appendix”) may have induced some respondents to use a global evaluation rather than focusing on yesterday’s affect.

Notably, ONS_worthwhile (“Overall, to what extent do you feel that the things you do in your life are worthwhile?”) does not appear to represent a different factor from the evaluative well-being factor. ONS-anxious loads on the negative affect factor, but with a surprising negative sign.

Table 5 shows the results when including the evaluative measures and the HWB12 experienced measures. In this case, four factors are retained, and their largest loadings in absolute value in each row are shown in bold. Again, the first factor represents evaluative well-being; the second factor now represents negative affect, while the third factor represents positive affect. The fourth factor mainly receives loadings from tired, bored, and pain, and thus represents a dimension related to fatigue rather than negative affect. These are all items that are not included in the Gallup item list. The items happy (“Yesterday, did you feel happy?”) and content (“Yesterday, did you feel content?”) load on all of the first three factors (negatively on the second, negative factor), while lonely (“Yesterday, did you feel lonely?”) loads negatively on factors 1 and 3, and positively on factors 2 and 4. ONS_happy (“Overall how happy did you feel yesterday”) loads on all of the first three factors, but negatively on the negative factor.

Table 5 Factor analysis: evaluative well-being and HWB12 (original) experienced well-being (n = 2,628)

Overall, a theme emerges of evaluative measures having different properties than experienced well-being measures. ONS Happy is somewhat of an exception, but as we observed before, the placement of this experienced well-being question immediately after an evaluative measures may have created confusion among respondents. We find that when conducting a factor analysis on both the Gallup and the HWB12 items, evaluative measures form a distinct factor from the experienced measures. We find differences in the number of dimensions for experienced measures, with a positive and negative factor in both Gallup and HWB12, as well as an additional fatigue factor for the HWB12 items. Our findings are in line with Headey et al. (1993), who conclude that life satisfaction and positive affect appear to measure relatively distinct dimensions (they find only a moderate correlation between the two dimensions). They also find that life satisfaction is highly negatively correlated with depression and moderately with anxiety and recommend measuring life satisfaction, positive affect, anxiety and depression separately so that we could better understand the causes and consequences of mental health.

There are two main differences between Gallup and HWB12: both the included items and the response scales differ. So without further analysis it is impossible to say if the added dimension is the result of added items or due to the response scale differences. In order to disentangle those two effects, the next section shows the results of factor analyses when including a set of common items, which only differ in the response scales used.

5 The Effect of Response Scales

As noted in Sect. 2, we have added questions at the end of various experienced well-being modules to allow for crosswalks between different instruments. As a result of this, respondents who received the HWB12 module, the Gallup module, and the respondents who received the ELSA module answered the same items in number and nature, but with different response scales. The ELSA response scale is of the form: “Overall, how did you feel yesterday? Rate each feeling on a scale from 0—did not experience at all—to 6—the feeling was extremely strong”. The response scale in the HWB12 questionnaire is of the form (taking “happy” as an example): “Yesterday, did you feel happy? Would you say: not at all, a little, somewhat, quite a bit or very.” And finally, the Gallup question reads: “Did you experience happiness during a lot of the day yesterday? Yes or no”.

Thus, these items include both the original items of each scale and the items that were taken from the other scales. Tables 6, 7 and 8 therefore include all 15 experienced “concordance” measures—all with different response scales matching the original survey design.

Table 6 Factor analysis: experienced well-being, ELSA response scale (n = 2,703)
Table 7 Factor analysis: experienced well-being, HWB12 response scale (n = 2,690)
Table 8 Factor analysis: experienced well-being, Gallup response scale (n = 2,788)

Table 6 displays the results of the factor analysis for experienced measures using the ELSA response scale. Two factors emerge when keeping factors with eigenvalues greater than one. The first factor, which we call “Troubled/Fatigue” represents negative affect, loading on frustration, sadness, anger, fatigue, stress, loneliness, worry, boredom, pain and depression. The second factor, which we simply call “Positive” groups the positive experienced measures, loading on happiness, interest, enthusiasm, content and joy.

We repeated this factor analysis using the HWB12 response scale (see Table 7). This time, three factors remained: a negative factor (factor #1, which we call “Troubled” mainly loading on: frustrated, sad, angry, stressed, worried, depressed), a positive factor (factor #2, which we call “Positive”, mainly loading on: happy, interested, enthusiastic, content, joyful), and a factor grouping items somewhat related to fatigue (factor #3, which we call “Fatigue”, mainly loading on: tired, lonely, bored, and pain).

Finally, when conducting the same analysis using the binary Gallup response scale, three factors remained (Table 8). The first (frustrated, sad, angry, lonely, worried, depressed, which we call “Troubled”) and third (tired, bored, pain, which we call “Fatigue”) are negative, while the second one (happy, interested, enthusiastic, content, joyful, which we call “Positive”) is positive. Note that three original items are dropped, asking whether the respondent smiled or laughed a lot, was treated with respect, or would wish to have more days just like yesterday.

A number of preliminary conclusions emerge. The number of factors retained is sensitive to the response scales used. The binary Gallup response scale yields three factors, the five-point HWB12 response scale yields three factors and the seven-point ELSA response scale yields two factors. This finding appears consistent with the older factor analysis literature where it has been observed that using categorical variables may lead to more factors, particularly if the distributions of the variables are skewed (see for example Lord and Novick 1968, or Olsson 1979). In comparison with Tables 4 and 5, where only original items were included, HWB12 yields the same number of experienced factors (3), but Gallup yielded two experienced and one evaluative factor when its original items were included, whereas with the common set of items the Gallup response scale yields three experienced factors. Thus, the fewer factors found in Table 4 are most likely due to the limited number of items included, as for instance boredom, fatigue, pain and loneliness are missing from the original Gallup response scale and indeed these contribute substantially to factor 3 in Table 8.

Factor analyses were also conducted on the common set of items, including evaluative measures (not shown here). The results in terms of the number of factors emerging remain quite similar, with one evaluative and two experienced factors when using the ELSA response scale, though it is worthwhile noticing that the ONS anxiety measure loads positively on the negative experienced factors rather than on the evaluative factor. HWB12 generates three experienced factors. In the Gallup case a fourth experienced factor (eigenvalue of 0.98) emerges representing mainly stress and pain. Interpreting the larger number of factors as an artefact of the cruder response scales suggests that it is advisable to use a response scale with a fairly large number of response categories, e.g. seven as in the ELSA response scale. In that case, experienced well-being can be described by two dimensions, one positive and one negative.

6 Relation with Individual Characteristics

While an extensive literature exists on the determinants of evaluative well-being (see for example Dolan et al. 2008), much less is known of the determinants of experienced well-being. We concentrate here on demographic and socio-economic determinants. The motivation for this is that these appear most amenable to policy (e.g. with respect to income, work, education, or childcare), while there is a general interest in exploring how well-being varies with age, family composition (Deaton and Stone 2014) and gender. Furthermore it is of interest to explore to which extent determinants of evaluative well-being are different from those of experienced well-being and whether the different dimensions of experienced well-being have different determinants. We investigate how the well-being measures are related to demographic variables, including race, gender, education level, age bracket, having a partner, as well as socio-economic variables such as income bracket and working status, while we also include self-reported health and number of children in the household in our model. Our questionnaires also included questions about respondents’ major life events taken from the HRS, which will be analyzed in a separate paper.Footnote 7 Formally, we specify the following model:

$$Y_{it}=\beta X_{it}+\epsilon_{it}$$

where X it is a vector of covariates, while \(\epsilon_{it}\) represents random error uncorrelated with the observable covariates. The subscript t indicates the wave (1 or 2) and i indexes the respondent. The model is estimated by ordinary least squares, where we allow for correlation of \(\epsilon_{it}\) across the two waves (t = 1 or t = 1) by clustering standard errors on individuals.Footnote 8 The simple equation specified here is not meant to provide a complete model of determinants of well-being and indeed one can imagine that causality sometimes runs from well-being to some of the right hand side variables. It is of interest nevertheless to investigate if the well-being measures covary with other variables in a plausible manner and to see if the relation between well-being and the right hand side variables is the same for each measure.

Table 9 shows the results for the evaluative measures. We have omitted the Gallup measures for 5 years ago and 5 years in the future; similarly for ONS we have only included the one true evaluative measure “Satisfied”. Given the different reference time frame used by those Gallup items and the experienced and eudemonic measures of the ONS scale, we chose to include only items referring to the present and involving evaluative measures. Looking at the effects of gender, we observe that these vary by outcome measure and are mostly insignificant. Men are less likely than women to agree with the statement “If I could live my life again, I would change almost nothing”. There currently is no consensus in the literature on the nature of differences in subjective well-being by sex, as some studies have shown higher levels of happiness for men (Haring et al. 1984) which could be related to higher prevalence of depression in women than men (Diener et al. 1999), while others have found that women report higher happiness (Alesina et al. 2004), and yet other studies have found no evidence of gender effects on subjective well-being (Louis and Zhao 2002; Dolan et al. 2008). Interactions between gender and education, income and having a partner did not yield any statistically significant results. Having a partner increases life satisfaction according to all measures. This result has also been found by others in the literature (see e.g. Dolan et al. 2008; Blanchflower and Oswald 2004). The presence of children in the household does not seem to consistently affect the well-being of the respondent, though as pointed out by Deaton and Stone (2013), this could be a function of controlling for factors associated with having children, such as being married, richer, and healthier. The results also show that by and large Blacks and Hispanics report higher subjective well-being than non-Hispanic Whites. Concerning education, the reference category for the education variables is “graduate education”. Although many coefficients are not statistically significantly different from zero, all significant coefficients confirm Oswald and Blanchflower’s finding of a positive relationship between education and well-being (2004).

Table 9 Regression of evaluative well-being measure on demographic and socioeconomic status variables

Subjective well-being increases monotonically with income according to all evaluative measures. In comparison to the reference category of respondents reporting an income above $100,000, we observe large negative and statistically significant coefficients for most lower income groups. The size of those coefficients suggests an almost linear relationship between income and subjective well-being measures in this income range. A positive relation between income and subjective well-being has been found many times in the literature, with existing research suggesting positive but diminishing returns to income (Dolan et al. 2008).

The reference category for age consists of respondents over 65. Several studies have suggested a “U-shape” in age with the lowest life satisfaction occurring in middle age (Dolan et al. 2008; Blanchflower and Oswald 2004). By and large that pattern is confirmed for the various well-being measures in the table. We observe that self-reported health—here coded as 1 being Excellent, and 5 being Poor so that a negative sign represents a higher level of health—is strongly correlated with well-being, which corresponds to general findings in the literature (Diener et al. 1999; Helliwell 2003).

With regards to working status, we used the category “working now” as a reference group, so that the results for individuals who are retired, disabled, unemployed, or in a different working situation (homemakers, or on sick leave, temporarily laid-off or other) represent differences with “working now”. Consistent with the literature, we observe a strong negative effect of being unemployed (see for instance Clark and Oswald 1994; Stutzer 2004; or DiTella et al. 2001). We also find a negative effect for being disabled, which appears in line with studies challenging the theory of hedonic adaptation whereby individuals suffering major changes in life circumstances, such as the onset of a disability, return to baseline levels of happiness (Lucas 2007). We also confirm prior findings (Kim and Moen 2002) of a strong positive relation between being retired and subjective well-being. Being in “Other work” has a positive, though not always significant, effect on subjective well-being.

Finally, the last five rows show the p values of joint significance tests for each category of characteristics. We cannot reject the hypothesis of no difference between the education categories except for the question “So far, I have gotten the important things I want in life”. Virtually all other categories are jointly significant.

The coefficients in Table 9 are not directly comparable across columns as the dependent variables are measured on different scales. However if the scales would be the only difference between the dependent variables, then coefficients in different columns should be fixed multiples of each other.Footnote 9 Table 10 summarizes the results of tests of proportionality of coefficients across the various models in Table 9. The Null Hypothesis for all the tests is formulated as follows: \(H_{0} = \frac{{\beta_{1,\,model\,1}}}{{\beta_{1,\,model\,2}}} = \frac{{\beta_{2,\,model\,1}}}{{\beta_{2,\,model\,2}}},\) etc. The entries in the table are the p values of tests of the null hypothesis for each of the pairs of models that we are considering. We observe that out of all ten possible combinations, the Null Hypothesis of proportionality of coefficients gets rejected at the 5 % level four times. All four rejections involve either the SWLS based on averaging the item scores or the SWLS based on factor analysis.Footnote 10 Inspecting the five items that constitute the SWLS makes it clear that only one item (“I am satisfied with my life”) corresponds with the simple one shot questions of SHARE, ONS, and Gallup. This suggests that the SWLS measures a somewhat broader concept of evaluative well-being than the other three measures. Yet, remarkably in the factor analyses presented earlier, it appeared that the items on the SWLS all loaded on the same factor along with the SHARE, ONS, and Gallup items on an overall evaluative dimension.

Table 10 Testing the proportionality of coefficients—evaluative measures (p values)

Table 11 shows the results of regressions for the explanation of experienced well-being measures. The dependent variables are scales based on factor loadings from factor analyses presented in Tables 6, 7 and 8. So in all cases the scales are based on the common set of items. It is of interest to not only compare the scales (which are only different because of differences in response scales), but also between the experienced scales and the evaluative scales, for which regressions were presented in Table 9. For both the ELSA and HWB12 scales males score lower on the negative affect (“Troubled”) scale (but marginally significantly positive for the Gallup scale). Here again, interactions between gender and education, income and having a partner did not yield any statistically significant results. Having a partner has little effect on experienced well-being (although the HWB12 scale suggests a somewhat lower score on the “Fatigue” scale), in contrast to the findings for the evaluative well-being scales where the presence of a partner has a strong positive effect.

Table 11 Regression of experienced scales on demographic and socioeconomic status variables

The effect of ethnicity is hard to summarize. According to the ELSA scale Hispanics and Blacks experience more positive affect compared to whites and non-Hispanic whites. According to the Gallup scales Blacks and Hispanics experience less positive affect, while the HWB12 scale shows no significant effects of ethnicity on positive affect. For Blacks we find more negative affect for the Gallup scale. Hispanics are less troubled according to the Gallup scale and more tired according to the HWB12 scale. Education also shows patterns that vary by response scale. The ELSA and Gallup scales show few significant effects. The HWB12 scale suggests that individuals with lower education experience less positive affect, while they are also less troubled, but more tired, bored and suffering from pain.

The most striking contrast between evaluative and experienced well-being is the effect of income. Whereas for evaluative well-being we observe a strong positive relation with income, such a relation is hardly discernible for experienced well-being. This result is somewhat stronger than earlier findings by Kahneman and Deaton (2010), who found that while life evaluation items rise steadily with socio-economic status, experienced measures of well-being do not improve beyond an annual income of approximately $75,000. Here we find very little evidence of a relation with income, although interestingly the Gallup scale produces marginally significant effects, which also is the scale used by Kahneman and Deaton (2010). Similarly, we observe that the U-shaped relation with age that we observed for evaluative well-being does not show up for experienced well-being. The results for labor market status show few consistent patterns across scales. As with evaluative well-being, health is an important determinant of experienced well-being. Both the ELSA and the HWB12 scale show that better health is associated with more positive affect and less negative affect (remember that Health is coded 1–5, so that a higher number means less good health). However for the Gallup scale the effects are reversed.

Joint tests of significance for each category of respondent characteristics do not reject the null of no effect for education (with the exception of the HWB12 factors), income, age (with the exception ELSA “Troubled/Fatigue” scale and the HWB12 factors), and race (with the exception of ELSA “Positive” and Gallup “Troubled” and “Positive”). Work status shows the strongest effects. Only Gallup “Positive” and HWB12 “Positive” do not show a significant relation.

Table 12 presents results of proportionality tests of coefficients in the various columns of Table 12, analogous to the results presented in Table 10. Since the positive and negative affect scales are assumed to tap different dimensions, we would not expect the proportionality hypothesis to hold for the different affect scales within ELSA, Gallup, and HWB12. For ELSA and HWB12 that is indeed the case, p values are 0.02 and 0.04 respectively. For Gallup this does not seem to be the case however: the null of proportionality between the three different affect scales does not get rejected. A second relation of interest is to see if the positive affect scales across ELSA, Gallup, and HWB12 satisfy proportionality. That indeed is confirmed by the entries in the table; p values are 0.77, 0.59, and 0.92. Thirdly we consider the negative affect scales. Here the expected patterns are somewhat less clear-cut as the negative affect scales vary somewhat across ELSA, Gallup, and HWB12. We do observe that the null of proportionality between ELSA Troubled/Fatigue and the Gallup and HWB12 Troubled and Fatigue scales gets easily accepted. Similarly we cannot reject the null of proportionality between HWB12 Troubled and Gallup Troubled, and between HWB12 Fatigue and Gallup Fatigue. On the other hand, HWB12 Troubled and Gallup Fatigue do not pass the null of proportionality, indeed suggesting that these scales measure something different.

Table 12 Testing the proportionality of coefficients—experienced measures (p values)

7 Conclusions

It is increasingly understood that traditional economic measures are necessary, but not sufficient, to measure societal progress (Stiglitz et al. 2009). Accordingly, in recent decades, research interest has been rising to find broader measures of well-being to be used to monitor societal progress and evaluate policy. The literature thus far has conceptualized subjective well-being either as the evaluation of life satisfaction/dissatisfaction (evaluative well-being measures) or as the combination of experienced affect (range of emotions from joy to misery).

In this paper, we conducted an experiment to investigate the relations between a number of evaluative and experienced measures (and one eudemonic measure), using the ALP. This is the first time that all these different types of measures have been collected jointly in a population survey. Although the concepts asked in the different experienced measures included in our experiment are in some cases the same, measures differ in the scales of their questions and so, we also studied the correspondence across these different scales. The experiment confirms a number of findings in the literature and yields some new results.

We find that all evaluative measures load on the same factor. Although this would suggest that there is not much to choose among them, the test results presented in Table 10, show that the SWLSs (both the one based on averaging items and the one based on factor analysis) have a different relation with demographics and self-reported health than the other three single item scales. Hence, for analyses of determinants of subjective well-being it does matter which measure one uses. The ONS-flourishing (eudemonic) measure does not seem to represent a separate factor; it mainly loads on the common evaluative factor.

The positive and negative experienced affect measures load on different factors, thus confirming that positive and negative affect are not simply opposite poles on the same scale. Depending on the response scale used, we find that negative affect can be represented by one or two factors. The ONS-happy measure loads both on the evaluative factor and on both the positive and negative affect factor. It is not entirely clear why this happens, but one possibility is the design of the ONS questionnaire, which places this experienced measure directly behind an evaluative question. Both previous points suggest the need for more work on the structure of questionnaires (response scales, lay-out, question order, etc.).

The relation of evaluative and experienced measures with demographics is markedly different. For instance, evaluative well-being increases monotonically and almost linearly with income; for experienced well-being no such relation with income is found. Evaluative well-being shows a U-shaped relation with age, while for experienced well-being no such relation is found. Also, health and labor market status, which have clear and significant effects on evaluative well-being, do not appear to have much of a consistent influence on experienced well-being. Whether one finds a relation or not appears to depend on the kind of response scale used in eliciting items. In general, it appears that the relation between experienced measures and demographics is much weaker than between evaluative measures and demographics.

The paper pays a fair bit of attention to the effect of response scales used for the affect measures. The different response scales imply a different number of underlying factors and different relations with demographics. This is clearly undesirable given that they all are based on the same items. The relation between experienced well-being and personal circumstances and demographics should not depend on whether we use a binary response scale, a five-point response scale, or a seven-point response scale. In a number of ways the ELSA seven-point response scale appears to behave better than the other coarser response scales (especially the Gallup response scales). This result confirms the theory of higher data quality, through higher validity and lower residual error, when using a higher number of answer categories (Andrews 1984). Partly this can be ascribed to the fact that with finer response scales, respondents can express their feelings in a more nuanced way, while assumptions of underlying normal distributions (which motivate many of the statistical procedures) will be closer to being satisfied by the data.