1 Introduction

Flourishing is defined as a state of well-being where individuals thrive in the social, emotional, and psychological areas of their lives [1,2,3]. It goes beyond just focusing on the absence of pathology; it also examines positive outcomes such as emotional regulation, supportive relationships, meaning and purpose, and life satisfaction. Historically, much research on well-being and flourishing has been conducted using adult populations rather than youth [4]. However, there has been a push to understand adolescent well-being better in recent years.

Well-being research shows that flourishing is associated with positive life outcomes and circumstances for youth and adults, including supportive social networks and relationships, positive work life, higher physical and mental health levels, and improved school performance [3, 5,6,7,8]. Given the prevalence of psychosocial stressors present during adolescence and the nature of adolescence as a critical period for social and emotional growth, it is essential to understand and measure adolescent well-being accurately to help them take steps toward flourishing.

Improving adolescents' well-being also has broader societal implications. Historically, developmental science, psychology, education, and other fields have underestimated adolescents, tending to focus on the problems they face (e.g., learning difficulties, mental illness, low motivation, substance use) rather than the strengths they possess [9, 10]. However, positive youth development research and similar research areas identify adolescents as having unique resources that they can use to contribute meaningfully to their community [9, 11]. Although working to improve adolescents' well-being and enabling them to use their strengths to contribute meaningfully to society is an important task, it may be challenging if it cannot be measured. Thus, the purpose of the present study is to examine the psychometric properties of the Survey on Flourishing (SURF), a measure of subjective well-being, using an adolescent sample from the United States.

1.1 Measuring well-being

Well-being is a broad, multifaceted construct and has historically been challenging to define. For decades, well-being was operationalized as the absence of physical or mental malfunction. However, more recent research indicates that well-being is not just the absence of problems but includes assets, strengths, values, and other positive characteristics [12]. In a classic paper on subjective well-being, Diener defined subjective well-being as a combination of positive emotion and life satisfaction [13]. Currently, definitions of subjective well-being most commonly include two components: emotional well-being, which consists of the presence of positive emotion and life satisfaction, and positive functioning, which provides for social functioning (e.g., social integration and contribution) and psychological functioning (e.g., autonomy and personal growth) [12].

These components of well-being also apply to adolescents. Researchers have identified specific developmental tasks that may indicate whether a child is doing well. These critical tasks in adolescence include academic achievement, forming close peer relationships, learning to follow rules, participating in extracurricular activities, and forming a sense of self-identity [14]. Critical to healthy adolescent development, these tasks generally align with well-being's social, emotional, and psychological components.

In addition to Diener’s definition of well-being, other models of well-being have been proposed, such as Seligman’s five-factor PERMA model, the Ryff model of psychological well-being, and Keyes’ model of social well-being [2, 15,16,17]. Several measures have been developed using these models, such as the PERMA Profiler, the EPOCH measure of adolescent well-being, the Ryff Scales of Psychological Well-being, and the Keyes Social Well-being Scale [15,16,17,18]. Recent literature reviews have identified many other measures of adolescent well-being, some notable measures including the Youth Quality of Life Instruments (Y-QOL), the Flourishing Scale, the Mental Health Continuum scales (MHC), the Warwick-Edinburgh Mental Well-being Scale (WEMWBS), the Child and Adolescent Wellness Scale (CAWS), the Social-Emotional Health Survey (SEHS), and the World Health Organization 5 well-being index (WHO-5) [19,20,21,22].

Although these measures differ somewhat in their underlying theories and content, researchers generally agree that subjective well-being is a subjectively experienced, multifaceted construct. There is substantial overlap among these measures, which suggests subjective well-being consists of social (e.g., connection, supportive relationships), psychological (purpose, achievement), and emotional (e.g., life satisfaction, positive emotion) components. Statistical approaches used to examine some of the most commonly referenced models suggest that a bi-factor approach may be effective for measuring subjective well-being [23, 24]. Specifically, after accounting for general subjective well-being or positive emotion, other components may explain additional (although relatively small) differences in people’s levels of subjective well-being. It is important to note that the literature did not provide a consensus on how many secondary factors exist or what they represent. However, the measures identified above appeared to generally examine social, psychological, and emotional well-being domains.

1.2 Limitations to current measures

Although the development of measures of adolescent subjective well-being represents significant progress, some limitations may impact their utility. First, measure content must be considered. Well-being researchers agree that subjective well-being is a broad, multifaceted construct. While three generally agreed-on domains include social, emotional, and psychological well-being, many essential domains, such as gratitude, transcendence, or mindfulness, have not yet been tapped into. Other measures of subjective well-being focus on a particular facet of well-being. Although this approach may be intentional, these measures may be too narrow to capture certain elements essential to broader adolescent subjective well-being. Similarly, Seligman suggests that no single measure can capture the breadth and depth of well-being [25]. Thus, although these measures may provide valuable information, they may be most useful when used with measures that examine alternate facets of well-being. The depth and breadth of subjective well-being suggests a need for additional measures that expand on the content of current measures.

Second, the generalizability of adolescent well-being measures depends on the sample used to examine their psychometric properties. Several measures, such as the EPOCH, the SEHS, and the Y-QOL, have thoroughly examined the measure properties within a broad U.S. sample and other populations [16, 26, 27]. Sample diversity is an increasingly important priority in the development of well-being measures. However, other measures relied on samples from outside the United States or did not collect nationally representative data when examining the psychometric properties of their measures. As it may be inappropriate to expect that a measure validated in one setting would perform equally across cultural groups, collecting a diverse sample is essential to test performance and validity. Relatedly, improvements regarding the sample characteristics are a common area of improvement identified in recent well-being literature [20]. Overall, current research suggests a continued need to develop and validate well-being measures across diverse samples.

Finally, other relatively minor limitations may affect the utility of these measures. First, some measures, such as the 100–150 item CAWS, are extensively lengthy [28]. Shorter measures may be more time-sensitive while demonstrating good psychometric properties with little item overlap. Second, accessibility determines the extent to which the measures can be used for many practical purposes. Out of these measures, the PGI is not available in the public domain, and the CAWS and WEMWBS are free to use with developer permission. The SEHS, EPOCH, MHC-SF, Ryff scales, and the WHO-5 are open access measures with developer acknowledgment [19]. Although paid measures may be effective, free measures, such as the EPOCH and the SURF, may display similar effectiveness and allow more widespread use.

In summary, although developing measures of adolescent subjective well-being is a step forward, some measures have some limitations, such as content domain, sample population, accessibility, and length, which impact their utility as adolescent subjective well-being outcome measures. Because adolescent well-being is becoming a greater priority in society, accurate measurement tools must be available to help individuals understand and improve it. In this paper, we examine the psychometric properties of the Survey on Flourishing adolescent version (SURF) as a novel measure of subjective well-being that is accessible and quick to administer. We also examine its reliability, validity, and factor structure.

1.3 The Survey on Flourishing

This study uses current research on adolescent well-being to examine the psychometric properties of the Survey on Flourishing (SURF) in an adolescent sample within the United States. The original SURF questionnaire was designed to measure subjective well-being by including items reflecting positive functioning and emotional well-being. When used in prior studies involving adolescents and adults, the SURF was shown to have good reliability and validity [29]. Thus, we expect the SURF to demonstrate similar psychometric properties and structure in the present study. Thus, this study aimed to examine the utility of the SURF by examining its internal consistency, factor structure, and convergent and discriminant validity.

Regarding the reliability of the SURF, we expected that the SURF would demonstrate good internal consistency by having a Cronbach’s alpha score (average inter-item correlation) between 0.80 and 0.90. Having a Cronbach’s alpha statistic in this range means that the test displays strong internal consistency, which is one facet of reliability.

Regarding the factor structure of the SURF, we expected the items to load onto a single general factor of adolescent subjective well-being based on a prior study using the SURF in an adult population (see Methods). This past research suggests that the SURF measures a unitary construct, which may represent the “higher level” construct described in bi-factor models.

Regarding the SURF’s validity, we expected the SURF to show a strong positive correlation (r > 0.70) with similar measures of well-being, such as the PANAS Positive Affect subscale (PANAS-Pos) and the Satisfaction With Life Scale (SWLS) while showing a weak negative correlation (r < − 0.5) with discriminant measures such as the PANAS Negative Affect Subscale (PANAS-Neg). These predictions were based on a previous study which showed that the SURF demonstrated similar psychometric properties when used with adolescents [29]. Good convergence with the PANAS Positive Affect subscale and the SWLS suggests that the SURF measures a similar construct. In contrast, a low correlation with discriminant measures would indicate that the SURF does not measure constructs differently from adolescent subjective well-being. These measures provide evidence that the SURF is measuring what it purports to measure.

We planned to examine the test–retest reliability and calculate the SURF test–retest reliability and Reliable Change Index (RCI). However, we could not conduct these analyses due to invalid second-phase data from the data collection site. We discuss this further in the discussion section.

2 Method

2.1 Participants

A total of 380 participants completed an online questionnaire using Qualtrics Online Sample from July to October 2021. Before delivering the data, Qualtrics removed 17 individuals who failed an attention check item (i.e., “Please answer ‘Strongly Disagree’ to this item”) and 11 participants who completed the study measures faster than two standard deviations from the average. To ensure consistent and reliable responses, we created a response validity scale using two matched item pairs. Eighteen participants were excluded as their average response deviation exceeded two standard deviations from the average. Therefore, 46 (12.1%) individuals who initiated the survey were removed from the analysis, resulting in a final sample size of 334 participants. The age of participants ranged from 12 to 17 years, with a mean age of 14.8. Out of the participants, 176 (52.7%) were female.

The study participants' race and ethnicity were recorded as follows: 47% identified as male, 52.7% identified as female, and 0.3% identified as transgender, nonbinary, or another gender identity. 63.2% identified as White, 14.1% identified as Hispanic or Latino, 13.8% identified as Black or African American, 4.5% identified as Asian, 0.6% identified as Native American/American Indian or Alaska Native, 0.3% identified as Native Hawaiian or Pacific Islander, and 3.4% identified as another race or multiple races, with 0.30% choosing not to respond. Regarding the participants' regions, 19.8% lived in the Northeast, 20.7% in the Midwest, 18.9% in the West, and 40.7% in the South. The sample was representative of the entire nation based on region, race/ethnicity, and sex, with the exception of adolescents who did not speak English, as the survey was conducted in English. Table 1 provides more detailed demographic information and corresponding percentages based on the US 2020 census.

Table 1 Demographic summary and normative data from each sample

2.2 Procedures

The data were collected through an online survey after the Institutional Review Board at Brigham Young University approved the study procedures. All individuals contacted were allowed to participate in the study, though participation was completely voluntary. Inclusion in the study required participants to speak English. Before beginning the study, both parent consent and child assent forms were completed on the first page of the survey. All participants completed consent/assent forms before participation in the study. Upon completion of the consent forms, the participants were either shown a study completion page if they opted out or directed to the first page of the study measures. Participants were asked to complete the measures, lasting approximately 25 min, in one sitting. Participants were also sent identical study measures 2 weeks after the initial measures were completed and were given 1 week to complete the second session. Participants were compensated for their participation by Qualtrics Online Sample after each session was completed.

2.3 Measures

2.3.1 Survey on Flourishing (SURF)

The Survey on Flourishing (SURF) measured subjective well-being (see Table 2). The SURF is a 20-item Likert scale designed to measure subjective well-being that is relatively brief, sensitive to change, and representative of the breadth of domains that contribute to human flourishing. To go beyond a generic cognitive self-assessment of hedonic well-being (happiness, satisfaction, feeling good), the SURF also taps into domains of eudaimonic well-being (engagement, growth, living well). Consequently, items were designed to assess important areas such as social connection, purpose, contribution, transcendence, and vitality, in addition to typical subjective well-being items assessing positive emotions and life satisfaction. The intention was to produce a global measure of flourishing that acknowledges this construct's multidimensional nature while being sufficiently brief and appropriately sensitive to change, facilitating its use as a research tool for evaluating well-being interventions. The SURF contains four negatively worded items reflecting the content of four positively worded items. These items provided a more robust measurement of the domains represented by those items. Including negatively worded items also may protect against some types of response bias. For all questions, respondents rate their agreement on a 7-point scale ranging from “strongly disagree” to “strongly agree.” The final score is calculated by taking the total of all items. The SURF requires approximately 5–10 min to complete.

Table 2 SURF Items and associated well-being construct

In a study examining the psychometric properties of the SURF with adults using multiple samples (manuscript in preparation), the SURF demonstrated high internal consistency (⍺ = 0.93–0.96) [30]. It also demonstrated convergent validity by correlating significantly with other measures of subjective well-being, including the PERMA profiler (r = 0.82), the Satisfaction with Life Scale (r = 0.74), and the PANAS Positive Affect subscale (r = 0.74). It also negatively correlated with discriminant measures, such as the Negative Affect subscale of the PANAS (r = − 0.61). In a previous study involving a small sample of adolescents from a high school in the Mountain West, the SURF also demonstrated high internal consistency (⍺ = 0.94). It also demonstrated convergence with the PERMA Profiler (r = 0.79), the SWLS (r = 0.75), and the PANAS-Pos (r = 0.69) [29].

2.3.2 Positive and negative emotion schedule, short form (PANAS) [31]

The PANAS measured positive and negative affect because affective experience is essential to subjective well-being. The positive affect subscale of the PANAS was used as evidence for convergent validity, and the negative affect subscale was used for discriminant validity. The PANAS-SF contains 20 items, each of which is a positive affect word (e.g., “enthusiastic” or “inspired”) or a negative affect word (e.g., “scared” or “hostile”). Respondents then used a five-point Likert scale to report the extent to which they were currently experiencing each emotion. Previous research has estimated the test–retest reliability after 8 weeks to be 0.54 on the positive affect scale and 0.45 on the negative affect scale [31].

2.3.3 Satisfaction with Life Scale (SWLS)

The Satisfaction with Life Scale was used to measure overall life satisfaction and convergent validity. The SWLS is a five-item Likert-style scale, which sums the item responses to estimate a total subjective well-being score [32]. It is the most commonly used instrument to measure life satisfaction and research has supported its reliability and validity in many populations, including adolescents [33]. The SWLS has demonstrated good test–retest reliability both after 2 weeks (r = 0.83) and after 1 month (r = 0.84) [34, 35]. Those people expected to report low life satisfaction scores (e.g., prison inmates, women experiencing intimate partner violence, and psychiatric patients) demonstrated low scores on the SWLS [35]. The SWLS also demonstrated convergent validity; the SWLS correlated significantly with other subjective well-being measures, including the Andrews/Withey Scale (r = 0.52–0.68) and the Fordyce Global Scale (r = 0.55–0.82), as well as interviewer ratings (r = 0.43–0.66) and informant reports of well-being (r = 0.28–0.58) [32, 35,36,37].

2.4 Data analyses

The data collected was analyzed using the Stata v16.1 statistical package. Internal consistency was determined through Cronbach’s alpha (α) and Pearson bivariate correlations. To investigate the factor structure of the SURF, we compared the fit of four competing models using confirmatory factor analysis.

The primary model we examined was a one-factor model with the latent variable of “subjective well-being” predicting scores on each item. Our decision to run a one-factor model was grounded in previous research examining the performance of the original SURF in an adult population, which demonstrated that the SURF items loaded onto a single factor identified as subjective well-being. Additionally, research on similar measures of well-being found that a general factor of subjective well-being explained a large portion of the variance in users’ scores [38,39,40]. Although prior research also suggests that a bi-factor model may fit observed data well, we tested a one-factor model of subjective well-being because the SURF was not designed with discrete factors in mind. After we conducted our a priori analysis, we used modification indexes in an exploratory fashion to identify other options for improving our primary model’s fit. However, we determined that no modifications were necessary.

The second model we examined was a bi-factor model with all items loading onto a general subjective well-being factor and specific items loading onto secondary factors representing social, emotional, and psychological well-being. These secondary factors were based on conceptual definitions of subjective well-being, and the item assignments were determined by a qualitative examination of each item’s content. We included this model after examining previous literature, which suggested bi-factor models have shown effective fit with alternate measures of well-being, such as the MHC, PERMA, and the WEMWBS discussed above.

In addition to the two models listed above, we ran two other models to investigate the impact of negatively worded items on the SURF. Research suggests that having an unequal amount of negatively and positively worded items may cause an unintended, “negatively worded item” factor to emerge during a CFA due to response bias [41, 42]. The third model we examined was a two-factor model consisting of a broad, subjective well-being factor and a negatively worded item factor. This model examined whether negatively worded items resulted in a statistical artifact. This model type has been used in a similar measurement study to investigate the impact of negatively worded items on the measurement instrument [18].

The final model we examined was a one-factor model of broad, subjective well-being, similar to Model 1. However, with this model, we aimed to account for the effect of negatively worded items by allowing the error variances of the four negatively worded items to covary. We compared this model to the two-factor model to explore whether the negatively worded items in the measure comprised an independent factor or fit better in a one-factor model.

Lastly, the SURF mean scores were correlated with the PANAS positive affect subscale, PANAS negative affect subscale, and the Satisfaction with Life Scale total scores to determine the measure’s convergent and discriminant validity.

3 Results

This study aimed to assess the reliability, validity, and factor structure of SURF. After cleaning the data, we evaluated its internal consistency, convergent and discriminant validity, and factor structure.

3.1 Data preparation

Before running our principal analysis, we conducted preliminary analyses to determine whether our data met the assumptions of normality for our planned statistical tests. We first identified outliers in the mean scores for the SURF, PANAS scales, and SWLS. We defined outliers as observations outside the bounds of two standard deviation units greater or less than the median score. We then fenced outliers to these outer bounds (median plus or minus two interquartile ranges). Fencing was used, as opposed to removing outliers, to minimize the possibility of skewed results resulting from participant responses significantly beyond the sample's median value. Fencing responses also allowed us to retain participant data, which was an important factor considering this study’s relatively small sample size. Ultimately, we identified and fenced 11 observations to the lower bound of the surf total score, 10 observations to the upper bound of the PANAS Negative Affect subscale, and nine observations to the upper bound of the SWLS.

We then examined the univariate normality of the SURF, PANAS subscale, and SWLS scores. A joint chi-squared probability test for normality demonstrated that the SURF means (p > 0.01), the PANAS Negative Affect subscale mean (p > 0.01), the PANAS Positive Affect subscale (p > 0.01) mean, and SWLS mean (p > 0.01) were each not normally shaped data. However, because of the nature of these measures, we concluded that abnormally shaped data was to be expected. For example, most participants likely demonstrated low levels of negative affect. Thus, we determined that no data transformations would be necessary.

We also conducted a chi-square difference test to determine whether our sample was distinguishable from national statistics. We determined that the sample matched with overall US Census data based on race/ethnicity (χ2 (7, N = 334) = 7.99, p = 0.33), gender (χ2 (2, N = 334) = 0.10, p = 0.95), and region (χ2 (3, N = 334) = 5.84, p = 0.12).

3.2 Internal consistency

Results demonstrated that the SURF’s internal consistency was high (α = 0.92; see Table 3). The average inter-item correlation for the SURF was 0.36. These results support our expectation that the measure would have good internal consistency. Of note, the SWLS (α = 0.86), PANAS positive affect subscale (α = 0.94), and the PANAS negative affect subscale (α = 0.92) also showed high internal consistency.

Table 3 Cronbach’s alpha and correlations between the SURF and other measures

3.3 Factor structure

Although Cronbach’s alpha provides evidence for internal consistency, it alone does not provide adequate information about the dimensionality or factor structure of the SURF. To obtain information about the factor structure of the test items, we conducted a confirmatory factor analysis (CFA). Previous research examining the factor structure for the original SURF questionnaire found that all test items loaded strongly onto one general factor, which was identified as subjective well-being. Because we expect that the SURF items are broad enough to allow for differences in interpretation between the general population and adolescents yet specific enough to retain good interpretability, we expected a one-factor model would demonstrate a good fit with the observed data. We compared the fit of our primary model with an alternate bi-factor model based on the current literature.

3.3.1 Model 1

Model 1 consisted of a one-factor model, with each item loading onto a latent variable representing subjective well-being (see Fig. 1). This model was identified according to the three-indicator rule [43]. According to this rule, a single-factor model is identified if it has three or more indicators and if no error terms are correlated. Our primary one-factor model of subjective well-being demonstrated adequate fit to the data (χ2 (170, N = 334) = 528.51, p < 0.001; model fit statistics can be seen in Table 4). The model’s root-mean-square error of approximation (RMSEA) = 0.08, which suggests moderate fit when considering the parsimony of the model (RMSEA values of < 0.08 indicate “acceptable” fit, while values < 0.05 indicate “good” fit) [44]. The standardized root mean squared residual (SRMR), which reports the average difference between the observed and implied covariances for the surf items, was 0.06, which suggests a moderate fit. The confirmatory fit index (CFI), which compares how well the identified model compares to a null model, was 0.87. This suggests that the identified model fits 87% better than a null model (a CFI above 0.90 is said to have adequate fit) [45]. The Bayesian Information Criterion (BIC), which can be compared to the BIC of other models to examine relative fit, was 22,106.87. These fit statistics suggest the model demonstrated adequate fit to the data. Per our a priori specifications, we examined model fit indices to examine possible changes to our model to be conducted post hoc, although we determined that no changes were theoretically supported.

Fig. 1
figure 1

Diagram of Model 1 structure

Table 4 Fit statistics for the models tested

3.3.2 Model 2

The second model we examined was a bi-factor model of subjective well-being. In this model, all items loaded onto a broad factor of well-being and one of three factors representing social, psychological, and emotional well-being (see Fig. 2). These factors were determined based on previous literature regarding the factor structure of subjective well-being [38,39,40]. It is important to note that upon running this model, we identified that item 13 had a negative residual variance. To obtain model convergence, we set that item’s error variance term to zero and ran the model. This model was theoretically identified using the t-rule (the number of observed values in the covariance matrix exceeded the number of estimated parameters). The bi-factor model demonstrated adequate model fit, with χ2 (151, N = 334) = 433.23, p > 0.00; RMSEA = 0.08; SRMR = 0.06; CFI = 0.90; BIC = 22,122.00. These data suggest that this model demonstrated a similar fit to Model 1.

Fig. 2
figure 2

Diagram of Model 2 structure. General: higher level well-being domain; SocWB: Social Well-being; PsyWB: Psychological Well-being; EWB: Emotional Well-being

3.3.3 Model 3

The third model we examined was a two-factor model of subjective well-being, with one factor representing subjective well-being and a second representing negatively worded items (see Fig. 3). We ran this model to explore whether negatively worded items may have created an artifact in the data. Because the SURF consisted of an unequal number of positively and negatively worded items, we suspected this might cause some item variance due to response bias instead of true scores. This model was also identified according to the three-indicator rule [43]. According to this rule, a two-factor model is identified if each latent variable has three or more indicators, no error terms are correlated, and each indicator loads onto only one factor. Model 4 demonstrated good fit to the data, with χ2 (169, N = 334) = 326.88, p > 0.00; RMSEA = 0.05; SRMR = 0.05; CFI = 0.94; BIC = 21,911.05. This model demonstrated a good fit for the data and accounted for approximately 98% of the variance in SURF total scores.

Fig. 3
figure 3

Diagram of Model 3 structure. SWB: subjective well-being; NWI: negatively worded items

3.3.4 Model 4

Lastly, Model 4 was also a one-factor model of subjective well-being similar to Model 1, although we correlated the error variance terms between the negatively worded items (see Fig. 4). This model was theoretically identified by using the t-rule. Model 4 demonstrated good fit to the data, with χ2 (164, N = 334) = 307.09, p < 0.001; RMSEA = 0.05; SRMR = 0.04; CFI = 0.95; BIC = 21,920.317. This model demonstrated a good fit to the data and accounted for approximately 92% of the variance in SURF total scores. BIC comparisons suggest that this model demonstrated the best fit relative to the other models. Item factor loadings and variance explained by each item can be viewed in Table 5.

Fig. 4
figure 4

Diagram of Model 4 structure. SWB: subjective well-being

To summarize, our analysis revealed that both the primary one-factor model and the bi-factor model showed satisfactory fit. Additionally, the two-factor model, which included the latent variables of subjective well-being and negatively worded items, as well as the modified one-factor model, demonstrated a good fit as well. After assessing all the models, we concluded that the modified one-factor model (Model 4) had the best fit with the data. We will discuss these findings in more detail below.

Table 5 SURF items, factor loadings, and variance explained by each item for the modified one-factor model

3.4 Convergent and discriminant validity

To examine the validity of the SURF, we correlated the total scores of the PANAS subscales and the SWLS with the SURF to estimate convergent and discriminant validity. The SURF total scores demonstrated a significant positive correlation with the SWLS (r = 0.70, 95% CI [0.64, 0.75], p < 0.001) and the PANAS positive affect subscale (r = 0.61, 95% CI [0.54, 0.67], p < 0.001). SURF total scores also demonstrated a significant weak negative correlation with the PANAS negative affect subscale, a measure of impaired subjective well-being (r = − 0.20, 95% CI [− 0.30, − 0.09], p < 0.001). All convergent validity correlations can be found in Table 3.

4 Discussion

4.1 SURF psychometric properties and structure

This study aimed to examine the reliability, factor structure, and validity of the SURF, a measure of subjective well-being, in an adolescent population. Results from this study provided evidence for good internal consistency and convergent/discriminant validity. Regarding factor structure, our primary one-factor model performed similarly to the alternate bi-factor model. The two-factor and modified one-factor models showed a good fit and suggested that accounting for bias resulting from negatively worded items resulted in the models’ good fit. We ultimately suggest that Model 4 demonstrates the best fit while also balancing parsimony.

Our comparison of the one-factor model (Model 1) to the bi-factor model (Model 2) showed that they fit the observed data similarly. Additionally, the specific latent variables included in the bi-factor model accounted for a very small portion of the variance in the scores after extracting the variance accounted for by the general subjective well-being factor. Interestingly, this finding supports prior research examining other bi-factor well-being models where the secondary factors are weak relative to the primary factor. This suggests that the second-level factors in Model 2 provided little utility beyond what the general factor could account for.

We also examined two additional CFA models in an exploratory manner to better understand the impact of item bias on model fit. Regarding the two-factor model, we expected that bias in the participant’s response patterns may result in a negatively worded item factor. After examining this model’s performance, we expect this would likely be the case. However, we also recognize that a second possibility as to why this model displayed a good fit resides in the content of the negatively worded item factor. The negatively worded item factor may represent a substantive construct such as depression or low mood. Items in this factor appear to reflect the absence of positive emotion, supportive relationships, and meaning. However, these negatively worded items were created as counterparts to positively worded items assessing the same content to provide a more robust measure of those item domains. Thus, we would expect that the positively and negatively worded items would load onto the same factor without response bias. Hence, we conclude that the most likely explanation for the good fit of this model is the presence of response bias, although the content of these items may have also impacted model fit.

Model 4 comprised a single factor representing subjective well-being, which allowed the negatively worded items to covariate and demonstrate a good fit. Regarding the model fit, after accounting for the error introduced through item response bias, the negatively worded items still loaded significantly onto the broad, subjective well-being factor. This suggests that the SURF items represent a unitary construct. In comparison to the two-factor model, we determined that this model demonstrates a similar fit, but it also maximizes parsimony. We determined that this model demonstrated the best fit with the data. These results also converge with prior research supporting a one-factor model.

Overall, the results from our examination of the SURF’s structure suggest that a one-factor model best fits the data and that the SURF items reflect a unitary construct representing subjective well-being. While the two-factor and modified one-factor models demonstrated similar fit, we expect these fit well because they accounted for the presence of item response bias. We retained the modified one-factor model because it appears to be the most parsimonious option while still demonstrating a good fit with the data (see Table 4).

These results have important implications when considering its interpretability and relationship with other measures. Because the factor analysis results suggest that the SURF measures a unitary construct, it allows us to compare its scores to other broad measures of subjective well-being, such as the SWLS and the PANAS subscales. Based on the results, the SURF does correlate with these measures in the expected direction. However, the unitary nature of the SURF also minimizes our ability to compare SURF scores to measures with more specific subscales, such as the EPOCH subscales. Although the SURF was developed incorporating various domains of well-being, these results suggest the SURF should be interpreted as a unitary construct, and efforts to improve the convergent and discriminant validity of the SURF should aim to include other broad-based measures of well-being, or at least measures which include a total score representing an overarching construct. Some possibilities include the YOQ-30.2, the Flourishing Scale, the EPOCH, or the Y-QOL.

4.2 Contributions to adolescent well-being measurement

As we describe how the SURF contributes to this body of literature, we first wish to highlight again Seligman’s statement that due to the breadth of well-being, no single measure can fully capture it. The SURF is one of many measures of subjective well-being currently in use. Although there are many quality measures of adolescent well-being, we believe the SURF has a place among these measures as a helpful tool to understand youths’ well-being better. The SURF displays several characteristics that may make it a valuable resource in youth subjective well-being measurement.

First, the SURF represents an additional measure of adolescent subjective well-being accessible, it is quick to administer, and users would have open access to the measure with acknowledgment. Many current measures of adolescent subjective well-being need to be improved in one or more of these areas, making them inappropriate for widespread use. The SURF provides another valuable alternative to these measures, broadening the pool of accessible instruments which clinicians, researchers, or others can use to measure subjective well-being as it fits their purposes.

Similarly, the SURF items contain content that has yet to be included in many adolescent subjective well-being measurements. This includes items on mindfulness, transcendence, and gratitude, essential to well-being. The items examining these constructs performed well within the factor analysis and accounted for an amount of variance in total score similar to other items typically seen in subjective well-being measures (see Table 5). These items’ showed strong factor loadings suggest that they fit into the intended construct (i.e., well-being), and the amount of variance explained by these items suggests that they contribute to the overall SURF score. Because it is difficult to capture the breadth of this construct with a single measure, it may be necessary to use multiple measures to understand adolescent subjective well-being best. The SURF can overlap with other measures to understand adolescents' subjective well-being better, especially given the inclusion of items that examine unique relevant to well-being that are not often included in well-being instruments. Overall, the accessibility of the SURF combined with the measured domains provides an effective and research-supported measure that can help researchers and practitioners better understand adolescent well-being and how to improve it.

The SURF also provides a reliable measurement tool whose psychometric properties have been examined using a broad sample within the United States. Because of the difficulty in collecting data on adolescent participants, many researchers use convenience sampling methods or existing infrastructure (e.g., school systems) to provide them with participants. Although data may be more easily gathered through these methods, often, this may limit the generalizability of the findings. Collection of census-matched data in the SURF’s early development provides promising data regarding the measure’s utility. However, additional data with a greater sample size must be collected to accurately assess the national response patterns to SURF items and maximize generalizability.

5 Conclusion

Overall, this study contributes to the current literature by providing a reliable and valid measure of subjective well-being. This study also had several strengths worth noting. First, this study employed a sample which matched 202 US Census data, which adds to the generalizability and utility of this measure. Although this is important, it must be tempered by considering our small sample size. Although we met Clarke and Watson’s suggestion that researchers collect at least 10 responses per scale item, with an ideal ratio of 15:1 or 20:1, a larger sample size will provide greater understanding of the SURF’s performance as described in this study, and provide greater understanding of how it will perform among various populations [31, 32]. A second strength of our study relates to our adherence to recommendations suggesting transparency in research methods and planned analyses [46, 47]. Unfortunately, we did not pre-register the analyses for this study. However, we specified a priori which statistical procedures we planned on running before conducting any analyses given our research questions.

Additionally, we specified which exploratory analyses were performed after running our primary analyses. Having a data-analysis plan reduces bias, which may result from questionable research practices, which researchers have shown to be extremely common among social scientists [48, 49]. Thus, our commitment to adhering to our data-analysis plan and transparently reporting the results helps increase replicability and gives evidence for the robustness of our findings.

5.1 Limitations

Although our study has notable strengths, it is also important to recognize limitations that may have impacted our results. Many of these limitations relate to our decision to use Qualtrics Online Sample to distribute our study and collect the data. First, although working with online panels allowed us to collect a sample matched with national proportions, some research regarding Amazon’s Mechanical Turk (MTurk), a similar online data collection site, suggests that response quality from online data collection agencies may often be low. In addition to data screening methods introduced by Qualtrics, we also introduced safeguards (e.g., validity metrics) to ensure high response quality [50].

Second, some research suggests that online data collection sites may saturate the responses with participants who are not representative of the intended sample despite responses to the demographic questions [50]. Although we did not detect any abnormalities that might suggest participants were not from the intended population, this still represents a risk and is a limitation of our study.

Third, because the online panel oversaw the distribution of the study measures, the study's second administration was distributed to participants at an incorrect time, not following our outlined methods (see Methods section). This ultimately invalidated the retest responses and restricted us from analyzing essential results such as test–retest data.

5.2 Future directions

This study highlights several future directions to improve the psychometric properties of the SURF. First, we recommend collecting additional data samples to replicate these results. Although this study provides a first step, the robustness of these findings would be increased if other adolescent samples displayed similar results. Future studies may also consider collecting samples less influenced by bias and of higher quality.

In addition to replication, several things will extend our understanding of the SURF’s psychometric properties. First, prioritizing the collection of data will provide additional evidence for the reliability and stability of SURF scores over time. Future studies might also include more convergent and discriminant validity measures, possibly including objective measures, such as those that examine the physical environment, access to resources, and physiology. Due to limitations regarding funding, and our desire to limit the length of the questionnaire to maximize participant completion rates, measures such as the EPOCH or the Flourishing Scale were not possible for this study. Thus, using other similar and dissimilar measures would provide greater evidence that the SURF measures what it purports to measure.

Regarding the sample, the current study provides good initial data regarding the psychometrics of the SURF. However, gathering data on the SURF from a larger sample would provide further information regarding the robustness of the measure and provide greater support for broad use of the SURF. As more data are collected from various demographics (i.e., age ranges, gender, race/ethnicity, etc.) conducting invariant analyses may help clarify whether the SURF performs as expected across various populations.

Lastly, we recommend that researchers include the SURF in intervention studies to improve adolescents' subjective well-being. One goal we had when developing the SURF was that the measure would provide a way for researchers to track changes in adolescents' subjective well-being over time and in response to intervention. By including the SURF as an outcome measure in future research, researchers may calculate the SURF's reliable change index (RCI) and determine whether this measure is an appropriate tool for that use. Addressing these issues will help further establish the SURF as a valid and reliable measure of adolescent subjective well-being.