Introduction

Psychological stress has been linked to numerous health problems worldwide, including cardiovascular disease, hypertension, and depression1,2. Traditionally, psychological stress has been monitored via patient-reported questionnaires, like the Perceived Stress Scale (PSS). The PSS is a well-established questionnaire for measuring stress, with high reliability and validity3,4,5,6. It has been widely used as a reference for studying other modalities of stress measurement (e.g., cortisol concentration7,8,9) and for measuring the effectiveness of stress management techniques10. More recently, there has been growing interest in AI-based digital health tools for assessment of stress, depression, and anxiety11,12. The Cigna StressWaves Test (CSWT) is a publicly available proprietary AI tool used for analysis of psychological stress based on the acoustic features of speech and semantic features of the words spoken from a user speech sample13,14. To our knowledge, no published validation data exists for it despite its wide availability and integration into a broader offering for managing stress and anxiety by a global health services company. This paper presents independent validation data for the CSWT.

Speech-based artificial intelligence (AI) models have been proposed to monitor a speaker’s stress level, but their validation remains limited compared to standard instruments. While there is a body of scientific literature around speech production under stress15, little has been done in terms of model validation12. In contrast, other scales like the PSS demonstrate high internal consistency, temporal stability, and construct validity, as evidenced by high intra-class correlations and correlations with other psychometric scales3,4,5,6.

Despite the lack of independent validation, the CSWT asserts "clinical-grade” performance16, utility as a “stress diagnostic tool”, and design for “regular check-ins to retake the test”17; all this connotes high reliability and validity18. In this paper, we assess these claims by examining the CSWT’s test–retest reliability and validity relative to the PSS.

Results

Sixty participants (36 F, 24 M) completed the CSWT twice during the same session (for reliability analysis) and the PSS once (for validity analysis). The PSS and CSWT were counterbalanced. Table 1 shows descriptive statistics for the stress scales and the participants’ age.

Table 1 Descriptive statistics of the study sample (N = 60, 36 F, 24 M).

Repeatability results

The test–retest reliability, as measured by the intra-class correlation between the two full-scale outputs of the two CSWT administrations, indicated that the test was not repeatable, (ICC = −0.106, p > 0.05). This is shown in Fig. 1, which displays the full-scale outputs from the two CSWT administrations and the line x = y. The reliability results did not change when the ordinal outputs were compared. Results of Cohen’s Kappa between the CSWT ordinal ratings showed no significant relationship between the two administrations of the test (κ = −0.176, p > 0.05).

Figure 1
figure 1

The test–retest plot for the Cigna StressWaves test. Each pair of samples was measured during the same session. The intra-class correlation of the test is ICC = −0.106, p > 0.05.

Validity results

Convergent validity was assessed by examining the correlation between the CSWT and the PSS full-scale scores and Cohen’s Kappa between the ordinal ratings. Results showed that the CSWT score (average of two test administrations) was not significantly correlated with the PSS (r = 0.200, p > 0.05). This is shown in Fig. 2, which displays the full-scale averaged outputs from the CSWT and the PSS. The validity results did not change when the ordinal outputs were compared. Results of the Cohen’s Kappa between the PSS ordinal ratings and the first and second CSWT administrations’ ordinal ratings showed no relationship (PSS vs. CSWT (1): κ = 0.127, p > 0.05; PSS vs. CSWT (2): κ = 0.12, p > 0.05).

Figure 2
figure 2

Convergent validity plot for the Cigna StressWaves test relative to the Perceived Stress Scale. The correlation between the two scores is r = 0.200, p > 0.05.

We further assessed convergent validity by using both CSWT administrations to predict the PSS via multiple linear regression. Results indicated that there was a collective significant effect between the two administrations of the CSWT and the PSS, (F(2, 57) = 3.184, p < 0.05, Adjusted R2 = 0.069). That is, when using both CSWT administrations to predict the PSS, the model explains 6.9% of the variance in the PSS. In totality, these results suggest poor convergent validity of the CSWT relative to the PSS.

Discussion

The CSWT is presented as a clinical grade tool and offered as a part of a broader stress management toolkit. The results herein fail to support the claim of clinical grade performance and raise questions as to whether the tool is effective at all. This external validation study found that the CSWT has poor test–retest reliability and poor validity. The convergent validity results suggest that the CSWT has limited agreement with the PSS. Even when both test administration results were used to predict the PSS using linear regression, the model explained only 6.9% of the variance in the PSS. Our findings align with previously-highlighted concerns that widespread adoption of AI technologies are being prioritized over ensuring the devices work12. The widespread availability of this tool for stress and anxiety management, particularly through a large insurance company, may lead users to rely on it for assessing psychological stress levels and making healthcare decisions. As a result, misleading or inaccurate results can contribute to a variety of negative consequences, such as inappropriate treatment, wasted resources, increased anxiety, or false reassurance.

Additionally, the CSWT’s interpretations of a respondent’s results are not limited to state psychological stress (acute, transient) that the respondent may be feeling at the time they complete the test; rather, their interpretations extend to trait psychological stress (e.g., "you’re under a balanced level of pressure day-to-day"). Extrapolating trait psychological stress from a single 1-minute speech sample is unlikely to be feasible, even if the CSWT scores were valid and reliable in assessing state psychological stress.

The results of this study serve as an example of the fallacy of AI functionality19, where companies deploy AI tools under the assumption that they work but without requisite validation data. In healthcare, the mechanisms for verifying claims about a device’s functionality are well-established20,21. Online digital health tools should not be exempt from this level of scrutiny. Any deployed digital health tools should be grounded in verifiable claims with published evidence of functionality. In the absence of such data, these tools should not be made widely available.

The results of this study further highlight the previously documented challenges associated with building speech-based measures of health22. The within-subject and between-subject variability associated with speech production makes robust cross-sectional prediction challenging. The lack of transparency with the CSWT (in terms of validation data, functionality, and contact information) also makes it difficult to evaluate model quality. While the CSWT does not make public the information regarding the underlying model (i.e., what acoustic and semantic features are used), the most common approach to building clinical speech models is supervised learning23. This is where the authors train high-dimensional models to predict a clinical variable of interest. It’s been documented that models trained under this paradigm are less likely to generalize22,23, which can be partially attributed to the variability of commonly used features in the clinical speech literature24. We posit that feature variability imposes inherent limits on any algorithm’s ability to accurately predict complex health constructs (i.e. psychological stress, depression, anxiety) directly from speech. It is important to note that this limitation cannot be overcome by collecting larger training data or using more complex models as it is a property of the variability associated with human speech production.

Method

Participants

Our study included 60 participants over the age of 18, recruited at Arizona State University. The research was approved by the institutional review board of Arizona State University (IRB #00016588). The methods were carried out in accordance with the approved IRB and informed consent was collected from all participants via an online form prior to the start of the experiment. The inclusion criteria for the study were broad: all participants who spoke English and were over the age of 18. The Cigna StressWaves website indicates that the device can be used by all English speakers, even if English is not their primary language13.

Test setting

All participants used the same equipment (i.e., Logitech H390 Wired Headset connected to a Dell computer) and conducted the experiment in a quiet laboratory environment. Participants were not shown their CSWT stress scores.

The Cigna StressWaves test

The CSWT is presented as a clinical-grade tool for assessing a patient’s psychological stress level based on analysis of their speech. The user is prompted to select a question and provide a response lasting at least 60 seconds. In this study, we asked participants to perform the test twice to evaluate test–retest reliability. Each participant responded to one of the eight prompts on two consecutive administrations of the test during the same session (all sessions lasted 10 min or less). The participant was able to freely choose any of the eight prompts for each of the two sessions. Only one participant chose the same prompt twice. The tool provides an ordinal scale output (i.e., low, moderate, or high) and a full-scale score presented on a gradient scale. Each participant also completed the 10-question PSS. The PSS is also scored numerically on a full scale and on a three-level ordinal scale (i.e., numerical range from 0 to 40; low, moderate, and high)25. The order of PSS and CSWT was randomized across participants.

Statistical analysis

The primary analysis in the study is the test–retest reliability, measured via the intra-class correlation (ICC) between the first and second administration of the CSWT. The secondary analysis is the evaluation of validity of the CSWT relative to the PSS, measured via the correlation between the PSS score and the average of the two CSWT scores. We average the scores between the two administrations to reduce CSWT variability. We use the PSS as a comparison as it produces a full-scale score on the same range as the CSWT. Both tests also provide ordinal ratings (low, moderate, high). For the ordinal ratings, we use Cohen’s Kappa to assess repeatability of the ratings and validity relative to the PSS. Statistical analyses were conducted using R Studio with the irr package26.

Power analysis

Sample size estimates are based on the primary analysis (test–retest reliability) using the method in27. We assume an expected ICC reliability of 0.75, per the definition of a clinical-grade test18. We set our threshold for acceptable ICC at the moderate level of 0.5. We use this lower threshold as a criterion because this is a novel test that relies on speech. Acoustic speech features inherently exhibit considerable variability, which we consider when establishing the lower performance benchmark24. For a significance level of 0.05 and a power of 80%, the required sample size is 55 subjects. We add an additional 5 subjects to account for potential dropouts, missing data, or issues during data collection. For the secondary analysis, a sample size of 55 subjects allows us to detect a correlation of at least 0.33 between the CSWT and PSS for a significance level of 0.05 and a power of 80%28.