1 Introduction

Moral elevation is a positive emotional state that is described as feeling inspired or moved after witnessing another person perform a remarkable act of virtue (Haidt, 2003). Research interest on this topic has increased over the past 15 years, which has led to important discoveries about what facilitates moral elevation and the positive correlates of those experiences (Pohling & Diessner, 2016; Thomson & Siegel, 2017). However, a significant problem within this area of study is the inconsistency in measurement, particularly when assessing state-level elevation. As researchers strive to expand the elevation literature, a reliable state-level measure will be essential for future studies that need to target important gaps such as (a) assessing specific contexts and elicitors of elevation, (b) determining within-person variability over time or in daily life, and (c) measuring potential benefits to well-being that can be attributed to state elevation induction. In this paper, we will review the distinctive features of state elevation, the utility of measuring this construct, and then describe the development and validation of a novel measurement tool for state-level elevation.

1.1 Distinctive Features

Before considering measurement, it is important to note elevation is a distinct positive emotion that differs from alternative, other-praising emotions such as gratitude and admiration. Specifically, elevation is distinguished by the theorized trigger, emotional and physical response, and action tendency or motivations. In a series of studies, Algoe and Haidt (2009) investigated and described differences among these domains across other-praising emotions. First, elevation is elicited by seeing another person engage in a behavior that is perceived as virtuous, whereas gratitude is triggered when someone perceives themselves as the recipient or benefactor of some positive outcome, often initiated by another person (McCullough et al., 2001). Admiration is elicited by witnessing someone demonstrate remarkable skill or talent. Notably, some researchers have described admiration as stemming from either skill or virtue, and there are parallels between admiration of virtue and elevation (Onu et al., 2016). However, consistent with the work of Algoe and Haidt, we will continue describing how elevation is distinctive from the aspect of admiration that refers to skills or talents, which is independent of the virtuous nature of that behavior.

Although admiration and elevation may share some similar subjective feelings, such as awe, elevation also includes subjective feelings of closeness to others (Erickson & Abelson, 2012), self-transcendence (Van Cappellen et al., 2013), and love towards humanity (Diessner et al., 2013). Elevation is also associated with a cluster of physical responses that are not equally shared across admiration and gratitude, which include warmth in chest, lump in throat, piloerection (i.e., goosebumps), and tears in eyes (Algoe & Haidt, 2009; Landis et al., 2009; Oliver et al., 2012).

Lastly, the motivational response to elevation is a strong desire to emulate the witnessed virtue and engage in behavior similar to the moral exemplar (Algoe & Haidt, 2009). Gratitude distinctly elicits a motivation to repay the person who is responsible for the benefactor’s positive outcome, or sometimes, to pay it forward to another person, whereas admiration is linked with motivations to pursue one’s personal goals and strive to succeed. Overall, these unique features support the operationalization of elevation as a separate, positive emotion and highlight aspects of elevation that should be considered in a state-level measure.

1.2 Previous Elevation Measures

During a review of the literature, we identified 50 studies between the years 2006 and 2021 as including some form of elevation measurement. One commonly used measure is the Engagement with Beauty Scale, which includes a moral beauty subscale. This scale has been validated in previous studies and demonstrates strong psychometric properties (Diessner et al., 2008). However, the moral beauty subscale measures a trait-like disposition or general tendency to feel inspired when exposed to moral beauty, which is distinct from understanding the experience of elevation in a given moment. Past work on measures of discrete, state-level experiences has demonstrated the presence of within-person variability that is not captured by aggregate or trait assessments (e.g., Fleeson, 2004). Researchers have also identified within-person variability in similar positive emotional states like gratitude (e.g., McGuire, et al., 2019a; Nezlek et al., 2019). Thus, dependency on trait-like measures could miss valuable information regarding the experience of elevation. For example, another method of measuring elevation that is often cited is based on a scale originated by Landis and colleagues (2009). Although this measure has strong face validity and is rooted in consultation with founding researchers in this area, it is not a stringent assessment of state elevation. Instead, this measure instructs participants to aggregate experiences and rate the frequency each item typically occurs when witnessing a virtuous act. This method does not account for context and is distinct from the level of elevation one might experience in a given moment following a specific event or exposure to targeted stimuli. To date, one assessment that appears to be the closest measure to state-level elevation is the Chinese Version of the Moral Elevation Scale developed by Ding and colleagues (2014). However, the initial psychometric study was conducted in Chinese, and it is unclear if this measure has been tested in other populations or if an English translation would result in the same reliability and validity characteristics in other samples.

Separate from the distinction between trait and state measures, it is important to note there is no clear, commonly used scale in the field of elevation research. We examined all items used across the 50 studies and identified 83 distinct items (see Supplemental Material Table S1). Assuming there is general agreement on the targeted construct of elevation, one legitimate concern is that the variability in assessments could give one pause when interpreting the validity and reliability of results within this body of literature as a whole. To avoid this potential conflict and to bolster this area of inquiry moving forward, access to a valid measurement tool that can be used widely is imperative.

1.3 Correlates and Benefits of Elevation

The continued study of elevation with proper assessment is important because elevation has been linked with several desirable outcomes. For example, elevation is associated with a wide range of enhanced social functioning that includes higher levels of compassion (Erickson et al., 2018; Siegel et al., 2014), prosocial behavior (Cox, 2010; Sparks et al., 2019; Van de Vyver & Abrams, 2015), increased desire to connect or help other people (Schnall & Roper, 2012; Schnall et al., 2010; Siegel & Thomson, 2017), as well as greater cooperation with and reduced stigma for perceived outgroups (Freeman et al., 2009; Lai et al., 2014; Oliver et al., 2015). Previous work has also found links with elevation and other markers of wellbeing including satisfaction with life, purpose in life, gratitude, hope, and personal growth (Diessner et al., 2008; Landis et al., 2009).

The potential psychosocial benefits of elevation are relevant to the general population at large, but this construct should also be of interest to clinical populations that suffer from psychological distress and their clinicians as the mental health field aims to cultivate these same benefits as treatment outcomes, or mechanisms to outcomes. Although the vast majority of research on elevation has examined the general population or nonclinical samples, select studies have identified specific benefits in clinical populations. For example, daily elevation has been associated with lower levels of depression, anxiety, and hostility in a clinical sample with moderate depressive symptoms (Erickson & Abelson, 2012). Additionally, early research suggests elevation might benefit those with trauma exposure as indicated by correlations with higher posttraumatic growth and compassion motives (Tingey et al., 2019), along with more treatment engagement and lower avoidance symptoms in veterans (McGuire, Nosen, et al., 2019). Thus, more research is warranted to explore the potential benefits of elevation for those with significant psychological distress; however, any advances in this area of inquiry will be significantly limited and potentially undermined until a reliable measurement tool for state elevation is established.

1.4 Considerations for Measurement in General and Clinical Populations

Despite the potential benefits to both general and clinical populations, it is important to consider group differences that could possibly impact measurement validity when assessing for a positively-valenced construct like elevation. Specifically, symptoms of psychological disorders—such as anxiety, depressive, and posttraumatic stress disorder (PTSD) symptoms—can impact how people perceive various state-level experiences and how they respond to stimuli in their environment. For example, depressive symptoms can be associated with reduced emotional disclosure (Kahn & Garrison, 2009), reduced cognitive processing, and negatively biased memory recall (Claúdio et al., 2012). PTSD is also associated with negative alterations in cognition, which often results in strong negative beliefs about self, others, or the world (American Psychiatric Association, 2013). Additionally, PTSD can lead to emotional numbing and difficulties experiencing positive emotions (e.g., Kashdan et al., 2006). These underlying symptoms within certain clinical populations are clearly relevant to the possible perceptions, experiences, and self-report of state elevation; thus, any scale development will need to assess for measurement invariance across clinical and nonclinical samples prior to future use across both groups.

1.5 Current Study

The purpose of this study is to develop a state-level measure of elevation—the State Moral Elevation Scale (SMES)—and examine the reliability and validity of this measure in both clinical and nonclinical populations. To examine the psychometric properties of this newly developed measure, we conducted exploratory and confirmatory factor analyses (CFA) using three subsamples with a total of 930 participants. Specifically, we aimed to include a minimum of 300 nonclinical participants for exploratory factor analysis (EFA) to determine the factor structure and assess reliability. Next, we aimed to conduct a CFA with two additional subsamples of a minimum of 300 participants each: a clinical and a nonclinical subsample. Participants were assigned to a given subsample based on positive (clinical) or negative (nonclinical) screens from a set of screening questionnaires for mental health symptoms. Lastly, we aimed to conduct a series of analyses to assess construct validity of this measure across both clinical and nonclinical subsamples.

2 Method

2.1 Scale Development

First, a pool of 83 items was generated by identifying items used in previous studies that attempted to measure the experience of elevation (see Table S1). Next, the authors categorized each item within one of three primary categories: Emotional Reaction, Physical Reaction, and Motivation. These categories were selected from previous work by Algoe and Haidt (2009) that investigated the features and distinctions of other-praising emotions, as previously described. Any items that did not fit these domains were placed in an Other category.

Last, we aimed to simplify this list and reduce participant burden by selecting six items from each primary category for a total of 18 candidate items (see Table 1). Item selection was based on (a) the frequency of use in previous studies (see Table S1) and (b) relevance to the theoretical framework of elevation as determined by the authors. The order of candidate items was randomized. To capture state-level experiences of these items, the following instructions were added: To what extent did you experience each of the following statements while [insert time/context specifier]. A Likert-type scale was used with ratings from 0 (not at all) to 4 (extremely) for each item.

Table 1 Initial 18 candidate items used for state moral elevation scale development

2.2 Participants

Participants were recruited through Amazon Mechanical Turk (MTurk)—an online platform that allows researchers to recruit participants to complete study tasks. Eligibility criteria included (a) at least 18 years or older; (b) must reside in the United States; (c) completed and approved for at least 100 previous MTurk tasks (e.g., online surveys or MTurk Human Intelligence Tasks); (d) at least a 95% approval rating from previous MTurk tasks or surveys. Criteria three and four were included to screen for participants who successfully completed substantial MTurk tasks in the past, and to exclude people with a history of rejected payments due to inconsistent responses or poor effort; thus, increasing our confidence in the quality of responses from this pool of participants. All study procedures were reviewed and approved by the local institutional review board. Participants provided assent online rather than written informed consent because they were not asked to provide identifiable information; therefore, participants remained anonymous throughout the study.

Participants were separated into three subsamples at the completion of the study, prior to analysis: two nonclinical subsamples and one clinical subsample. The first nonclinical subsample was used for the preliminary EFA and reliability assessment. The second nonclinical subsample and the clinical subsample, which was determined based on responses to clinical symptom screening measures, were used to assess measurement invariance and validity of the SMES. All participants across all subsamples completed the same study procedures as outlined below. For a summary of demographic characteristics across the entire sample and subsamples, see Table 2.

Table 2 Demographic characteristics by subsamples

2.3 Procedures

Participation for this study included two, separate online surveys (see Fig. 1).Footnote 1 The first survey (part one) assessed demographic information and characteristic traits, and it included screeners for both clinical symptoms and attention/effort. The second survey (part two) asked participants to watch a 5-minute video clip intended to induce elevation, followed by state-level measurements including the elevation items. The video clip showcased a father who competed in marathons/triathlons with his son who had paraplegia by carrying, pushing, and pulling his son during those races. This video was used in previous studies that examined elevation and was validated as a demonstration of virtuous behavior in one study (Erickson et al., 2018).

Fig. 1
figure 1

Study design including screening procedures, descriptions of surveys part 1 and 2, and analyses associated with generated subsamples

Both surveys were advertised as “a survey about personal experiences,” estimated to last approximately 10–15 minutes each, and included a $1.50 payment for each survey. In the final sample, the median completion time for part one was 8.56 min and 10.03 min for part two, which resulted in a median hourly rate of over $9.74 per hour across both surveys. This rate is above the median hourly wage for tasks performed on MTurk, which has been estimated at approximately $2.00 per hour (Hara et al., 2018).

2.3.1 Part One Survey

A total of 1,304 participants were recruited to participate in part one of this study. Part one included two attention screeners (see Supplemental Material) placed after approximately one-third and two-thirds of the questionnaires, which assessed whether participants carefully read instructions and responded appropriately. Participants who failed both attention screeners were rejected payment and excluded from the study. To further screen for poor responding, we also excluded participants who reported inappropriate answers to count questions about how many days the participant engaged in specific behaviors during the past week (e.g., in the past seven days, I interacted with others on ten days). The final sample size for part one was 1,049 after a total of 255 participants were excluded: 66 did not provide consent or complete the study (i.e., did not submit their responses), 43 failed two attention screeners, 3 were excluded for not completing attention screeners, 142 provided inappropriate responses to count questions, and 1 participant was excluded for being under the age of 18.

2.3.2 Part Two Survey

Within three days of completing part one, eligible participants received a message notifying them of the opportunity to complete part two of the study. A total of 1,010 participants completed this follow-up survey. The majority of participants (60.43%) completed part two within two days of part one (M = 3.01 days, SD = 2.00). Part two contained an additional attention screener that asked participants to provide a text response to a question about the elevation video: “What exercise were the people in the video seen doing the most?” Participants who reported inappropriate answers (e.g., “not sure”; “good”) or exercises not demonstrated in the video (e.g., “jumping jacks”; “yoga”) were excluded from data analysis. The final sample size for part two was 930 after a total of 80 participants were excluded: 72 failed the video-specific attention screen and 8 were excluded for having more than 20% missing data (7 participants with 67% missingness, 1 participant with 27% missingness).

2.4 Clinical Sample Screening Measures

The Patient Health Questionnaire-9 (Kroenke et al., 2001) was used to screen for depressive symptoms in the past two weeks and has demonstrated adequate reliability and validity in past work. Participants rated the nine items on a 0 (not at all) to 3 (nearly every day) scale. The items were summed, with higher scores indicating greater symptom severity (ωcategorical = 0.95 [0.94, 0.95]).

The Generalized Anxiety Disorder-7 (Spitzer et al., 2006) was used to screen for anxiety symptoms in the past two weeks and has demonstrated validity and reliability in past studies. Participants rated the seven items on a 0 (not at all) to 3 (nearly every day) scale. The items were summed, with higher scores indicating greater symptom severity (ωcategorical = 0.94 [0.93, 0.95]).

To screen for trauma history, participants completed the first item of the Primary Care PTSD Screener for DSM-5 (Prins et al., 2016), which asks whether you experienced an event that was frightening, horrible, or traumatic. Those who answered yes also completed the PTSD Checklist for the DSM-5 (Weathers et al., 2013) to screen for PTSD symptoms in the past month, which has demonstrated validity and reliability in past work (Blevins et al., 2015). Participants rated the twenty items on a 0 (not at all) to 4 (extremely) scale. We summed the items, with higher scores indicating greater symptom severity (ωcategorical = 0.99 [0.99, 0.99]).

2.5 Measures used in Construct Validity Analyses

2.5.1 Part One Survey Measures

The Freiburg Mindfulness Inventory-Short Form (Walach et al., 2006) assessed trait mindfulness. Participants rated the 14 items on a 1 (rarely) to 4 (almost always) scale, which were summed, with higher scores indicating higher trait mindfulness. The initial study demonstrated validity and reliability in a nonclinical sample, and future studies also found similar psychometric properties with clinical samples (e.g., Kocovski et al., 2013). For this and all other remaining measures, see Table 6 for a summary of reliability estimates.

The Gratitude Questionnaire-6 (McCullough et al., 2002) assessed trait gratitude. Participants rated the six items on a 0 (very unlike me) to 6 (very much like me) scale, which were summed, with higher scores indicating higher trait gratitude. The Gratitude Questionnaire-6 was developed and validated in a nonclinical sample with strong psychometric properties. Other studies also found evidence of construct validity and high internal consistency in clinical samples (Ghalesefidi et al., 2018; Ruini & Vescovelli, 2013).

The Engagement with Beauty Scale (EBS; Diessner et al., 2008) assessed trait-like tendencies to be inspired by three different kinds of beauty: natural, artistic, and moral beauty. Participants rated a total of 14 items on a 1 (very unlike me) to 7 (very much like me) scale, summed to create three subscale scores with higher scores indicating greater tendency to be inspired in each domain. The initial psychometric study demonstrated validity and reliability with a nonclinical sample. The authors are not aware of other studies that assessed the psychometric properties of EBS in a clinical population; however, this study found adequate internal consistency for all three subscales (see Table 6 for estimates of composite reliability).

The Marlowe-Crowne Social Desirability Scale-Short Form (Strahan & Gerbasi, 1972) assessed social desirability, or a tendency to respond in a way that might be perceived as socially acceptable. Participants responded true or false to ten statements coded as 0 or 1 with five reversed items. After reverse-scoring, items were summed with higher scores indicating greater social desirability responding.

The Ten-Item Personality Inventory (Gosling et al., 2003) assessed the Big Five personality traits with items rated on a 1 (disagree strongly) to 7 (strongly agree) scale and summed to create five subscale scores: agreeableness, conscientiousness, emotional stability (neuroticism), extraversion, and openness to experience. Higher scores indicated higher levels of each personality domain. Past work has examined the Ten-Item Personal Inventory and found additional support for construct validity (Jonason et al., 2011).

2.5.2 Part Two Survey Measures

The International Positive and Negative Affect Scale (Thompson, 2007) assessed state-level positive and negative affect after viewing the elevation stimulus. Participants rated ten items on a 0 (not at all) to 4 (extremely) scale, which were summed to create subscale scores for positive and negative affect. For this study, the positive affect sum score excluded inspired and was limited to the remaining four items because inspired was also included as an SMES item. Higher scores indicated higher levels of affect. Initial measurement validation used a nonclinical sample, but previous studies have also demonstrated validity and reliability this measure in clinical samples (e.g., Serafini et al., 2016).

The State Shame and Guilt Scale (Marschall et al., 1994) assessed state-level shame and guilt after viewing the elevation stimulus. Participants rated ten items on a 1 (Not feeling this way at all) to 5 (Feeling this way very strongly) scale, which were summed to create subscale scores for state guilt and shame. Higher scores indicated higher guilt or shame. Previous studies examined the State Shame and Guilt Scale and found evidence of validity and reliability in clinical (e.g., Levinson et al., 2016) and nonclinical samples (e.g., Fedewa et al., 2005).

The Negative Cognitions about the World subscale of the Posttraumatic Cognitions Inventory (Foa et al., 1999) assessed current negative beliefs about others. Participants rated the seven subscale items on a 1 (totally disagree) to 7 (totally agree) scale, which was summed to create a total score, with higher scores indicating more severe negative beliefs about others. Although this measure was designed for and tested with clinical samples, select studies have also noted strong reliability when assessing nonclinical or subclinical populations (e.g., Contractor et al., 2020).

2.6 Data Analytic Plan

2.6.1 Identifying Subsamples

All data management and analyses were conducted with R (R Core Team, 2021). First, scores on the mental health screeners were reviewed across the entire sample. Participants who screened positive on the Generalized Anxiety Disorder-7 (score ≥ 10; Spitzer et al., 2006), Patient Health Questionnaire-9 (score ≥ 10; Levis et al., 2019) or PTSD Checklist for the DSM-5 (score ≥ 31; Bovin et al., 2016) were placed in the clinical subsample (n = 321). The remaining participants were randomly separated into two nonclinical subsamples using the sample function from the base R stats package: one subsample for the EFA (n = 304) and one for the CFA (n = 305).

2.6.2 Factor Analysis

First, we used the psych package (Revelle, 2020) to conduct a parallel analysis to help determine how many factors to extract from the SMES items (Horn, 1965). We then used the lavaan package (Rosseel, 2012) to conduct an EFA with the first nonclinical subsample. EFA model fit was assessed using fit indices including comparative fit index (CFI), Tucker-Lewis Index (TLI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). A model was indicated as having acceptable fit with CFI > 0.90, TLI > 0.90, RMSEA < 0.10, and SRMR < 0.08 (Brown, 2015). Following recommendations from Kenny et al. (2015), who found the RMSEA has an upward bias for structural equation models with small degrees of freedom, we de-emphasized the RMSEA in models with fewer than 100 degrees of freedom. In addition to fit indices, we also examined the magnitudes of the loadings, preferring items with primary loadings above the conventional 0.40 threshold. Standardized factor loadings are reported. We hypothesized the results would indicate a one-factor model because items were selected from three domains that are theorized to collectively distinguish elevation from other positive emotions (Algoe & Haidt, 2009).

Once an acceptable factor structure was established, we used the lavaan package to conduct a multigroup CFA (MGCFA), which assessed SMES for measurement invariance between the clinical and second nonclinical subsample (i.e., participants not included in EFA). Specifically, we assessed nested models by imposing greater restrictions for successive models in the following order: equivalence of gross factor structure across groups (configural invariance), factor loading equivalence (metric invariance), intercept equivalence (scalar invariance), latent variance equivalence, and latent mean equivalence. Following Cheung and Rensvold (2002), we continued to impose greater restrictions unless the CFI decreased by more than 0.01. Because the SMES items were based on five ordinal response options, many of which had asymmetric distributions in our subsamples, we used the weighted least square mean and variance adjusted (WLSMV) estimator in all EFAs and CFAs (see Rhemtulla et al., 2012).

2.6.3 Reliability

Given the criticisms contemporary methodologists have levied against Cronbach’s α (e.g., Kelley & Pornprasertmanit, 2016; McNeish, 2018), we estimated all composite reliabilities using coefficient ω. When using item-level data to compute scale reliability estimates, it is important to consider how the items are distributed, primarily with respect to whether they are grossly asymmetric and whether they are best described as continuous or categorical. Taking cues from Rhemtulla et al. (2012), we treated items as categorical when they contained 4 or fewer categories, or if they had 5 categories and markedly asymmetric distributions. We treated items as continuous if they had more than 6 categories or had 5 categories and roughly symmetric distributions. Accordingly, we estimated composite reliability with ωcategorical for measurements of clinical symptoms (anxiety, depression, and PTSD), state guilt and shame, state positive and negative affect, mindfulness, social desirability, and state elevation. The items for measurements of engagement with beauty, gratitude, personality and posttraumatic cognitions were sufficiently continuous that we estimated their composite reliabilities with ω. Following recommendations from Kelley and Pornprasertmanit (2016), we used the MBESS package (Kelley, 2020) to accompany all reliability estimates with 95% confidence intervals.

2.6.4 Construct Validity Analyses

Additionally, we examined construct validity across the clinical and second nonclinical subsample, combined. To assess concurrent validity, we examined the correlations between state elevation scores and trait mindfulness, trait gratitude, and state positive affect. Given that state elevation requires an awareness and recognition of another person’s virtuous behavior, we hypothesized that greater dispositional awareness of one’s surroundings as indicated by trait mindfulness would be positively associated with state elevation. Similarly, trait gratitude involves a greater predisposition to be aware of others’ positive or prosocial behaviors (e.g., acknowledging one is the benefactor of another’s good deed) and to be thankful for those behaviors; thus, we hypothesized state elevation would be positively correlated with trait gratitude. Lastly, since elevation is a positive emotion, we hypothesized state elevation would be positively correlated with state positive affect (i.e., general positive emotion) experienced after watching the video.

We also assessed for convergent validity by examining the correlations between state elevation scores and the Engagement with Beauty Scale subscales: artistic, natural, and moral beauty. All three subscales represent a predisposition to experience a similar emotional response of feeling inspired and/or moved by a stimulus that is perceived as remarkable; however, the moral beauty subscale specifically measures responses to virtuous behavior (i.e., acts of moral beauty). Thus, we expected state elevation scores to correlate with all subscales scores given that there is likely an underlying predisposition to be moved in general; however, we hypothesized the association would be strongest for moral beauty

To assess divergent validity, we examined the correlations between state elevation scores and other state-level measures of negative emotional and cognitive responses to the same video stimulus. Specifically, we hypothesized that scores on the positive emotional experience of state elevation would be negatively correlated with general negative affect, guilt, and shame. Similarly, we hypothesized that SMES scores would be negatively correlated with negative thoughts or beliefs about others.

3 Results

3.1 Exploratory Factor Analysis

In the first nonclinical subsample, the SMES items contained no missing data. Across all items, all five response options were endorsed, with some items showing roughly symmetrical distributions (e.g., 8 and 16), whereas others were markedly asymmetrical (e.g., 1 and 5). On the whole, the initial set of 18 items was highly correlated, with a median intra-item polychoric correlation of 0.68 and an interquartile range of 0.55 to 0.77. The sole stand-out was item 16 (muscles feel relaxed; see Fig. 2), which demonstrated small to near-zero correlations with the other items. The parallel analysis used the weighted least squares estimator to evaluate a polychoric correlation matrix in 1000 iterations. It indicated we could extract as many as three factors with eigen values greater than would be expected from random data. The eigen values for the first three factors were 11.7, 0.8, and 0.6. Accordingly, all EFA models were estimated in batches of three, to allow us to contrast one-, two-, and three-factor solutions for each combination of items.

Fig. 2
figure 2

Polychoric correlation matrix for initial item pool of 18 items for the State Moral Elevation Scale in the first nonclinical subsample (n = 304)

The first three EFAs were based on all 18 items, with Model 1.1 containing one factor, Model 1.2 containing two factors, and Model 1.3 containing three factors (see Table 3 for summary of all model fit statistics). All three models showed poor fit by the model χ2, but acceptable fit by the CFI, TLI, and SRMR. The RMSEA was poor for both Model 1.1 and Model 1.2, but acceptable for Model 1.3. In Model 1.1, the single-factor model, all items loaded well with the exception of item 16 (λ = 0.37, 95% confidence interval [0.27, 0.46]), which was foreshadowed by the correlation matrix in Fig. 2. The pattern of loadings in Model 1.2 was not readily interpretable. Though one may have expected the items in Model 1.3 to separate by their theorized categories of Emotional Reaction, Physical Reaction, and Motivation, they did not.

Table 3 Exploratory factor analysis: Model comparisons with single factor model (n = 304)

We fit the next batch of EFAs after dropping item 16, which was the one item showing poor performance in Model 1.1. The fit statistics for Model 2.1 (the single-factor model) and Model 2.2 (the two-factor model) were comparable to their counterparts, above—their χ2 and RMSEA values were poor, but they had acceptable CFI, TLI, and SRMR values. Also like before, the RMSEA was only acceptable for the three-factor model, Model 2.3. As with the first batch of models, Model 2.1 was characterized by strong factor loadings, but the pattern of loadings was difficult to interpret in both Model 2.2 and Model 2.3.

For the third batch of EFAs, we aimed to create a simple structure by eliminating additional items. Using the three categories that guided initial item selection— Emotional Reaction, Physical Reaction, and Motivation—we retained the three items with the highest factor loadings for each category in Model 2.1, resulting in nine items. One exception was made for the motivation items based on theoretical rationale; specifically, we did not retain item 10 (like I want to do something good too) because of redundancy with other motivation items. Once again, we fit three models with these items, which included from one (Model 3.1) to three factors (Model 3.3). As before, all models failed the χ2 test of exact fit and all showed acceptable CFI, TLI, and SRMR values (see Table 3). Given its positive bias in small-df models (Kenny et al., 2015), we placed less weight on the RMSEA when comparing the models in this batch. Whereas Model 3.2 and Model 3.3 did not contain a readily-interpretable pattern of item loadings, the single-factor model (Model 3.1) contained 9 strong loadings (see Table 4), with an average loading of 0.86 [0.84, 0.88]. The estimated scale reliability was high, ωcategorical = 0.97 [0.95, 0.98].

Table 4 Exploratory factor analysis-based factor loadings for the single-factor, 9-item version of the State Moral Elevation Scale (n = 304)

3.2 Multigroup Confirmatory Factor Analysis

Next, we used CFA and MGCFA to replicate the single-factor nine-item model derived from the previous EFA (Model 3.1) and to assess for measurement invariance in the clinical subsample (n = 321) and the second nonclinical subsample (n = 305). First, separate single-group CFAs for each subsample indicated acceptable fit for both clinical and nonclinical subsamples across all indices except RMSEA, which we continued to de-emphasize due to its known upward bias in models with small degrees of freedom (Kenny et al., 2015). For a full listing of their fit statistics, see the first two rows in Table 5. The initial MGCFA model—the configural model, which specified gross factor structure equivalence but allowed all parameters to differ by subsample—continued to show adequate fit (see row three in Table 5). We proceeded with more restrictive models and terminated with the most restrictive model, which held the item loadings, item thresholds, latent variances, and latent means constant between the two subsamples. From model to model, the CFI never decreased below the 0.01 threshold and the most restrictive model had acceptable fit statistics for the CFI, TLI, SRMR, and even the RMSEA. Therefore, we concluded the single-factor nine-item SMES fit well in both the clinical subsample and the second nonclinical subsample, and there was evidence of measurement invariance between the two subsamples. In addition, the SMES scale reliability was high in both subsamples (clinical ωcategorical = 0.93 [0.90, 0.94]; second nonclinical ωcategorical = 0.96 [0.94, 0.97]), as well as in the combination of the two (ωcategorical = 0.94 [0.93, 0.95]). The standardized factor loadings in the final MGCFA model were similar to those in the EFA displayed in Table 4 (see Supplemental Material for details).

Table 5 Results from multigroup confirmatory factor analysis of 9-item State Moral Elevation Scale

3.3 Construct Validity Analyses

Validity analyses were conducted with the same subsamples used for the MGCFA using Pearson correlations. Descriptive statistics for all validity measures are reported across both subsamples in Table 6. Results from correlational analyses are presented in Fig. 3 and are reported for both clinical and nonclinical subsamples. Exact numeric summaries are available in Supplemental Materials (see Table S2).

Table 6 Descriptive statistics for clinical subsample and second nonclinical subsample used for analysis of invariance and validity
Fig. 3
figure 3

Results from correlation analyses between State Moral Elevation Scale (SMES) scores and variables assessing construct validity. Dots and lines represent Pearson correlations and their 95% confidence intervals for each subsample, separately. The background is shaded according to conventional effect-size regions of small (.1 to .3), medium (.3 to .5), and large (.5 and up)

3.3.1 Concurrent Validity

As hypothesized, there were moderate positive associations between SMES scores and trait mindfulness and trait gratitude, and a strong positive association with positive affect experienced immediately after viewing the same video clip.

3.3.2 Convergent Validity

Results indicated moderate positive associations between SMES scores and all three subscales of trait-like engagement with beauty in both subsamples. As hypothesized, the effects were strongest for the association between state elevation and engagement with moral beauty.

3.3.3 Divergent Validity

As hypothesized, SMES scores were negatively associated with negative views about the world or other people after viewing the video clip in both subsamples; however, the effect size was surprisingly small. Unexpectedly, state elevation did not demonstrate strong associations with guilt, shame, or negative affect. Results indicated correlations were in the opposite direction, but the effect sizes were near zero, ranging from 0.06 to 0.11, with some confidence intervals that contain or approach zero. For example, the correlation between the SMES and negative affect in the overall sample was 0.06 [− 0.02, 0.14].

4 Discussion

The purpose of this study was to develop a measure of state-level elevation and assess psychometric properties, including measurement invariance across clinical and nonclinical subsamples.

First, results from the EFA supported the expected single factor structure with nine items. Additionally, findings indicated strong internal consistency and provided preliminary evidence of validity with adequate fit statistics. The nine retained items represent an equal distribution of elevation-related motives, as well as emotional and physical responses. These domains and items are consistent with theory regarding elevation’s core and distinctive features compared to other positive emotions. Many of these retained items were also among the most commonly used items in the elevation literature to date. For example, inspired was the most frequently used among previous studies reviewed prior to data collection (53.1%; see Table S1). Warm or glowing feeling in my chest (36.7%) and want to become a better person (32.7%), were also in the top five most frequently used items.

CFA results for each group demonstrated acceptable fit statistics, but more importantly, results from the MGCFA suggested measurement invariance across clinical and nonclinical subsamples at the most restricted levels. Model fit did not degrade at any step of invariance testing and the final model—which included constrained factor loadings, item intercepts, latent variances, and latent means—indicated adequate fit with regard to CFI, TLI, and SRMR, and RMSEA. In sum, these results suggest total scores on the SMES have the same meaning in both clinical and nonclinical subsamples. This finding is particularly important for any future research efforts that aim to assess whether the experience of state elevation can benefit those suffering from psychological distress.

Consistent with hypotheses, findings from correlation analyses in both subsamples provided evidence of concurrent validity with state positive affect, trait gratitude, and trait mindfulness. Additionally, SMES scores were significantly related to all subscales of the EBS, but the highest effects were found with self-reported tendencies to be moved by moral beauty (i.e., trait elevation), as expected. This finding provides initial support for this measure as a state-level assessment of the same trait-like construct captured by the established engagement with moral beauty subscale. Overall, results from correlation analyses with relevant positive constructs, measured at both the state and trait levels across both subsamples, were consistent with hypotheses and provided support for concurrent and convergent validity.

In support of divergent validity, SMES scores were negatively correlated with negative views of the others after watching the video, as expected; however, hypotheses regarding negative correlations with negative emotions (i.e., negative affect, guilt, shame) were not supported. One possible explanation could be restricted variance around negative emotions at the state-level following a positive, inspirational video. This study design was not intended to induce any emotions prior to watching the video, but perhaps participants would report a more diverse set of negative emotions if they were primed to be in a negative mood prior to the video. Nevertheless, a negative correlation with negative views of others is consistent with the theorized cognitive appraisal of elevation and provides initial support for divergent validity.

4.1 Limitations and Future Directions

Results should be considered in light of several limitations. First, this study used one stimulus to elicit state elevation for all participants; however, it is unknown whether the selected video could be considered a universal stimulus, or if there is significant variability in the type of content or stimulus that elicits elevation across people. Addressing this gap is an important area for future research that will expand our understanding of how elevation is experienced and improve study designs moving forward.

Second, this data was collected with an online survey using a crowdsourcing platform, which involves uncertainty regarding participants’ effort and attention during study tasks. Nonetheless, we aimed to increase confidence in the quality of results by including a fair compensation, screening for participants with a history of successful completions, and using multiple attention screeners. Related, there are also limitations to relying on screener tools for mental health symptoms and it is possible that the identified clinical subsample might not be representative of a clinical population with significant distress. Therefore, future research is needed to replicate these results using alternative recruitment methods, including clinical samples with more comprehensive assessment procedures.

Lastly, this data was collected at the early stages of the COVID-19 pandemic in late March 2020; therefore, it is possible responses to survey questions could have been affected by changes in participants’ life circumstances at the time, or the subsequent impact on their psychosocial wellbeing. Thus, attempts to replicate these results in future studies could also address potential chronology bias or any unexpected influence associated with the pandemic.

5 Conclusion

Although past work has identified several psychosocial benefits to elevation and provided reason for further exploration, a significant limitation to date is inconsistency in the measurement of state-level elevation experiences. To our knowledge, no state measure has been developed and thoroughly tested for reliability and validity in the English language; particularly with respect to measurement invariance with clinical populations, despite increased interest in studying the potential therapeutic benefits. Overall, results from this study indicate that the SMES is a psychometrically valid and reliable tool for assessing state elevation in both clinical and nonclinical populations. This development is important for critical future directions in the field of elevation research, such as studies that aim to assess specific contexts and elicitors of elevation (i.e., experimental designs), determine within-person variability over time or in daily life (i.e., longitudinal designs), and measure potential benefits to wellbeing that can be attributed to state elevation induction (i.e., intervention studies). It is our hope that this measure facilitates robust research on elevation moving forward and contributes to our understanding of its full potential for psychosocial wellbeing.