Internalizing disorder symptoms such as depression and anxiety are among the leading causes of disability worldwide (Kotov et al., 2017; Whiteford et al., 2013). Most individuals with internalizing disorder symptoms prefer, and are more likely to adhere to (Swift et al., 2017), treatment with psychological interventions than with medications (Löwe et al., 2006; McHugh et al., 2013). The most well-researched psychological interventions available for internalizing disorders are cognitive–behavioral therapies (CBTs; Lorenzo-Luaces, 2018; Lorenzo-Luaces, et al. 2021a; Cuijpers et al., 2019). CBTs are a family of interventions that aim to change behavior, cognition, and meta-cognition via the use of cognitive, behavioral, and more recently acceptance and mindfulness-based interventions (DeRubeis & Lorenzo-Luaces, 2017; Hofmann & Hayes, 2019). Despite the existence of CBTs and other effective interventions, the burden of internalizing disorders has not decreased in the past four decades and most individuals still do not receive treatment (Jorm et al., 2017; Kazdin & Blase, 2011).

To increase the dissemination of psychological interventions, researchers have developed self-help approaches to the treatment of internalizing disorder symptoms. Self-help provides much of the same content as do individual psychological interventions (e.g., psychoeducation about stress, coping skills), but the content is delivered via websites or internet/phone applications (Lorenzo-Luaces et al., 2018a, b; Wasil et al., 2021b), books (i.e., bibliotherapy; De Jesús-Romero et al., 2022), other formats such as group lessons (Dolan et al., 2021), or some combination of these. In unguided self-help, an individual uses the self-help material by themselves. In guided self-help (GSH), an individual uses self-help material with a professional or paraprofessional promoting adherence to the material along with emotional support. GSH-CBT appears to be roughly comparable to face-to-face CBT in its efficacy and both GSH-CBT, and individual face-to-face CBT is more effective than unguided self-help (Cuijpers et al., 2019). Given that GSH-CBT does not require the presence of a highly trained professional, it may be more scalable than face-to-face psychotherapy and therefore may have greater potential to reduce the public health burden of psychopathology (Kazdin & Blase, 2011). Despite the promise of GSH-CBT approaches, there are numerous barriers to their widespread adoption. Many popular books and apps have little to no research supporting their efficacy (Wasil et al., 2020). Those that are empirically supported are usually not available to the general public or to most clinicians (Wasil et al., 2021b) and clinicians do not frequently use GSH approaches (Peipert et al., 2022a, b). Even those interventions that both have been studied and are freely available are often not very accessible (e.g., they are written at a very high reading level; Martinez et al., 2008).

Beyond its scalability, the acceptability of an intervention may be a factor determining its successful dissemination (Michie et al., 2011). Wolf (1978) was one of the first to define acceptability, although they used the terms “social validity” and “social importance,” referring to the extent that an intervention introduces changes consistent with a client or society’s goals, deemed appropriate to use, and satisfactory to the client. Although GSH-CBT appears to have roughly equal outcomes to face-to-face individual CBT, it tends to be less acceptable to patients as evidenced by higher rates of dropout after treatment initiation (Cuijpers et al., 2019). Researchers have tried a number of strategies to increase the acceptability of GSH-CBT while maintaining its scalability, including changing the modality of the guidance (e.g., using text messages or e-mail instead of video calls), adjusting the dose of guidance, using eCoaches with varying qualifications (e.g., paraprofessionals versus experienced psychotherapists), and comparing synchronous versus asynchronous communication modes (Baumeister et al., 2014). By and large, the literature supports the importance of guidance in self-help CBT for achieving better outcomes and engagement, but does not provide strong support for other ways of increasing acceptability (Bur et al., 2022; Furukawa et al., 2021).

Doing What Matters in Times of Stress (DWM), previously called Self-Help Plus, is a GSH-CBT developed by the World Health Organization (WHO) based on principles of acceptance and commitment therapy (ACT), a form of CBT (Hofmann & Hayes, 2019). Originally, DWM consisted of five group-based psychoeducation sessions delivered through pre-recorded audio materials with a corresponding print guide. Two small trials supported the efficacy and feasibility of DWM (Tol et al., 2018a, b). In a subsequent cluster randomized controlled trial (RCT) of Sudanese refugees in Uganda (Tol et al., 2020), 694 women from 14 villages were randomized to DWM or enhanced care as usual care (eCAU), which consisted of brief psychoeducation plus the provision of referrals. After 6 weeks, individuals in DWM reported lower rates of internalizing distress [standardized mean difference (SMD) = 0.72] and higher levels of psychological well-being (SMD = 0.51). In a preventive RCT with refugees and asylum seekers (N = 459), DWM appeared effective in reducing mental health symptoms among individuals with no mental health diagnosis, but did not predict a lower incidence of mental health diagnoses at a long-term follow-up (Purgato et al., 2021). A more recent RCT (N = 642) supported the preventive effects of DWM versus eCAU for Syrian refugees in Turkey (Acarturk et al., 2022). Taken together, these data support the efficacy of DWM as a group-based mindfulness and acceptance-focused GSH-CBT.

Because DWM is made freely available by the WHO (https://www.who.int/publications/i/item/9789240003927) and was developed at an accessible reading level (approximately fourth grade), it has the potential to be a widely scalable intervention. Following the COVID-19 pandemic, several translations of the DWM booklet were made available online. Given the logistic difficulties associated with group-based GSH-CBT (e.g., COVID-19 risk associated with gathering), the WHO adapted the DWM intervention materials such that they could be delivered remotely via telephone or videoconferencing. We conducted an open trial aiming to test the feasibility of conducting GSH-CBT using DWM with individual online coaching, instead of group-based meetings, for individuals across the United States. To our knowledge, this is the first published study of DWM implementation in the US. We assessed the acceptability of the intervention as evidenced by engagement with the guidance in GSH-CBT, self-reported usability, and self-reported satisfaction. In addition to tracking feasibility and acceptability of the intervention, we assessed internalizing symptoms, well-being, and purported emotion regulation mechanisms of change: cognitive reappraisal and expressive suppression (Aldao & Nolen-Hoeksema, 2010). We also measured quality of life and work and social functioning, which may be less responsive to psychosocial intervention than internalizing symptoms (Lorenzo-Luaces & Amsterdam, 2018; Peipert et al., 2022a, b).

Methods

This study was approved by Indiana University’s Institutional Review Board. Participant recruitment began October 17, 2020, and ended February 21, 2022. All recruitment was conducted online and all participants provided informed consent for the study. The study was registered on ClinicalTrials.gov on July 1, 2021 (NCT04870099).

Participants

Participants were adults (ages 18+) living in the United States who were recruited from social media platforms, primarily Facebook, as well as Instagram and Twitter. We promoted the study by writing a post in our lab’s social media accounts directed to individuals “struggling with stress, depression, or anxiety.” The post contained a survey link and a flyer describing major components of the study (e.g., the compensation rate). Given the limited reach of our lab’s webpage, we used the paid features of the above social media platforms to promote the post. We sought to recruit participants using very minimal entry criteria. The only criterion for entry was the presence of at least mild psychological distress, as evidenced by a score \(\ge\) 6 on the six-item Kessler Psychological Distress Scale (K6; see “K6” below). The only criterion for exclusion was the presence of recurrent death ideation/suicidality, which we initially operationalized as a score \(\ge\) 1 (i.e., “several days”) on item nine of the Patient Health Questionnaire-9 (PHQ-9; “thoughts that you would be better off dead or of hurting yourself”). We subsequently modified the suicidality criteria to exclude participants who scored \(\ge\) 2 (i.e., “more than half the days”), thus allowing a PHQ-9 item nine score of one. We made this change given that our early data suggested that most individuals who were excluded from the study based on the original criterion only had infrequent death/suicidal ideation. Participants who were excluded on the basis of suicidality, regardless of how it was operationalized, were given resources for emergencies as well as for finding outpatient treatment providers.

Valid Respondents

Given that we conducted this study over the internet and across the United States, there was a possibility of fraudulent responses, including automated bots. To identify fraudulent respondents, we operationalized “bot-like” or fraudulent behavior as (a) responding that appeared implausibly fast (e.g., completing the roughly 100-question intake in 5 min or less), (b) filling out the study screening at a time in which we were not actively promoting the study, (c) providing duplicate contact information across entries (e.g., the same home address or e-mail), (d) providing seemingly fake contact information (e.g., the name “Deez Nutz”), or (e) other behavior that may be indicative of fraudulent responding (e.g., answers to questions that were actually “hidden” from survey participants), including inconsistent responding to survey items. Out of 486 “clicks” on our screening survey link, we deemed that 12.55% (n = 61) were fraudulent respondents. Of the purportedly human respondents (n = 425), 162 individuals did not complete the screening survey, for a total of 263 individuals fully screened for trial entry (see Fig. 1).

Fig. 1
figure 1

Trial progression for individuals on a trial of transdiagnostic guided self-help for internalizing distress

Intervention

DWM is a five-chapter booklet that discusses principles of ACT and CBT, namely mindfulness (here called “grounding”), cognitive distancing (“unhooking”), value-based behavioral activation (“acting on your values”), gratitude (“being kind”), and acceptance (“making room”). Participants were given the option of reading a digital version of the DWM booklet, though most (88.89%) opted to receive the printed booklet by mail. Eligible participants were assigned an “eCoach” and contacted within a week of their qualification for the study.

eCoaches

The eCoaches were undergraduate/post-baccalaureate (JH, CL, and KB) and graduate students (RDJR, AP, and JB) at the beginning of the study. All were young adults and all were completing degrees in psychology. None of them had experience with ACT, though two had completed an introductory practicum in CBT prior to the start of the study (AP and RDJR). The undergraduate/post-baccalaureate eCoaches completed a four-hour training covering common mental disorders, principles of CBT, and technology-assisted GSH-CBT. All eCoaches (i.e., graduate and undergraduate students) completed a four-hour training over the course of four weeks that included reading the DWM booklet, completing the WHO’s EQUIP training (EQUIP), reading a chapter on “basic helping skills” from the WHO’s Problem Management Plus (PM+) guide, and reading a DWM-specific training manual provided by the WHO. Part of the training involved a safety plan for dealing with emergent suicidality. The safety plan involved administering the Columbia Suicide Severity Rating Scale (CSSRS; Posner et al., 2008) and giving crisis referrals for individuals who had acute suicidal ideation. The eCoaches also had to “pass” a roleplay with the principal investigator (LL-L).

Beyond passing the role-play with the PI, competence was not assessed systematically nor was adherence. For all interactions with participants, eCoaches were given semi-structured scripts. There were no statistically significant differences between the eCoaches in rates of study retention or changes any study outcome (ps > 0.09; see “Appendix”).

Guidance

The first communication for the study was an onboarding or “welcome” call in which participants were provided information about the study, given a chance to ask questions, and asked to schedule three to six individual weekly sessions over a 6-week period. We let participants schedule fewer than six sessions aiming to maximize engagement for participants who thought six sessions would be too burdensome. During the welcome call, eCoaches helped participants create a plan to use the guide, primarily via a short exercise of mental contrasting with implementation intentions. As part of the exercise, eCoaches helped participants identify how they wanted to use the guide, predict effects of using the guide, identify potential obstacles to adhering to the guide, and brainstorm potential solutions for the predicted obstacles. Subsequent meetings also followed a semi-structured script.

The guidance meetings began by asking participants to confirm they are still available to meet as well as to confirm their understanding of study confidentiality. The eCoaches ensured that study measures had been filled out and used a principle of measurement-based care: reviewing changes in internalizing symptoms, well-being, and emotion regulation with participants. Next, eCoaches assessed adherence to the reading, which was typically scheduled as one chapter per week. If the participant was able to use the guide with no or only a few challenges, they were praised for their efforts, asked about whether they practiced the specific exercises described in the chapter, and asked to comment on what they noticed about practicing the exercises. If the user was not able to use the guide, the eCoaches were instructed to show empathy and understanding and to review the participant’s plan to use the book, including whether it adequately addresses challenges the participant encountered. Regardless of whether participants were able to read the book or not, if they were still interested in participating, the eCoach confirmed the next appointment. The PI provided weekly, as well as ad-hoc, supervision. Supervision focused on using concrete behavioral strategies (e.g., problem-solving) for promoting adherence to the intervention and dealing with resistance by using principles of motivational interviewing.

Outcome Measures

The main outcome measures we used to assess efficacy, potential mechanisms, acceptability, and feasibility are described below. All survey questionnaires were administered online using RedCap. For each survey questionnaire, we present a measure of internal consistency omega (\(\omega\)), which is interpreted in the same manner as Cronbach’s α but may have more desirable psychometric properties (Flora, 2020).

Acceptability and Feasibility

We report the number of participants who (1) completed our baseline assessment, (2) qualified for the study, (3) we were able to reach for an onboarding call, (4) completed one session of GSH-CBT, (5) completed more than 50% of the scheduled assessments, and (6) completed the post-treatment (week 6) assessments. In addition to these metrics, we administered the Systems Usability Scale (SUS; Brooke et al., 1996), a brief measure for assessing the usability of given systems (e.g., websites). The SUS has ten items and is rated on a five-point scale with responses ranging from 0 (“strongly agree”) to 4 (“strongly disagree”). Scale scores are multiplied by 2.5 to produce scores ranging from 0 to 100. Prior work supports the reliability and validity of the SUS (Lewis, 2018). We administered the SUS at week 6, but we also administered a slightly modified version of the SUS at baseline to control for between-individual differences (e.g., the tendency to give high ratings). Participants were asked to rate the usability of the book at baseline “based on the small amount of information [they] have on the study.” In the current study, the baseline scores on the SUS appeared internally consistent (\(\omega\) = 0.85, 95% CI 0.81–0.89).

K6

The K6 (Kessler et al., 2002) is a six-item self-report scale that measures internalizing distress (e.g., nervousness, depression). Its items are rated on a scale of 0 (“none of the time”) to 4 (“all of the time”), producing scores ranging from 0 to 24 where higher scores indicate greater psychological distress. Previous studies support the reliability and validity of the K6 (Batterham et al., 2018; Staples et al., 2019) and it has previously been used as an outcome measure in GSH-CBT studies (Lorenzo-Luaces et al., 2018a, b). A score of 13 indicates “severe” symptoms of internalizing distress. In the current study, the baseline scores on the K6 appeared internally consistent (\(\omega\) = 0.7, 95% CI 0.62–0.79). The K6 was administered at baseline and every week thereafter, up to the week 6 post-treatment assessment.

WHO Well-Being Index (WHO-5)

The WHO-5 (Topp et al., 2015) is a five-item self-report scale that measures subjective well-being, an aspect of positive mental health. Its items are rated on a scale of 0 (“at no time”) to 5 (“all of the time”). The raw total scores (0 to 25) are multiplied by 4, producing final scores ranging from 0 to 100, where higher scores indicate greater well-being. Prior work supports the reliability and validity of the WHO-5 (Topp et al., 2015), and it has previously been studied as an outcome measure in GSH-CBT studies (Tol et al., 2020). A score of 50 is considered a useful cutoff for screening for major depression. In the current study, the baseline scores on the WHO-5 appeared internally consistent (\(\omega\) = 0.79, 95% CI 0.73–0.86). The WHO-5 was administered at baseline and every week thereafter, up to the week 6 post-treatment assessment.

Work and Social Adjustment Scale (WSAS)

The WSAS (Mundt et al., 2002) is a five-item self-report measure that assesses impairment in work, relationships, household, and leisure activities as a result of a specific problem; in our study we queried the effect of “stress” on functioning (e.g., “[b]ecause of my stress my ability to form and maintain close relationships with others, including those I live with, is impaired”). Each item is rated on a nine-point Likert ranging from zero (“not at all”) to eight (“very severely”), producing scores ranging from zero to 40 where higher scores indicate greater impairment. Scores over 10 and over 20 are considered to indicate moderate and severe impairment, respectively. Prior work supports the reliability and validity of the WSAS across various patient populations (Zahra et al., 2014). In the current study, the baseline scores on the WSAS appeared internally consistent (\(\omega\) = 0.82, 95% CI 0.77–0.87).

Emotion Regulation Scale (ERQ)

The ERQ (Gross & John, 2003) is a ten-item self-report measure of individual differences in the use of two emotion regulation strategies: cognitive reappraisal (ERQ-reappraisal; items 1, 3, 5, 6, 8, and 10) and expressive suppression (ERQ-suppression; items 2, 4, 6, and 9). Prior work supports the reliability and validity of the ERQ in community samples (Preece et al., 2019, 2021). The ERQ items are rated on a seven-point Likert scale with responses ranging from 1 (“strongly disagree”) to 7 (“strongly agree”). We averaged item scores to produce final scores on the same metric of the original items (i.e., 1 to 7) in order to make the ERQ-reappraisal and ERQ-suppression subscales (with differing numbers of items) comparable. In the current study, the baseline scores on the ERQ-reappraisal (\(\omega\) = 0.83, 95% CI 0.77–0.88) and ERQ-suppression appeared internally consistent (\(\omega\) = 0.87, 95% CI 0.83–0.91). The ERQ was administered at baseline and every week thereafter, up to the week 6 post-treatment assessment.

Other Assessments

We also administered (1) the PROMIS Depression Short Form (also known as the Cross-Cutting DSM Severity Measure for Depression; American Psychiatric Association, 2013), an eight-item self-report measure of depression symptoms that produces scores ranging from 8 to 40, and (2) the Alcohol Use Disorder Identification Test (AUDIT; Allen et al., 1997), a self-report screening scale for problematic drinking that produces scores ranging from 0 to 40, and (3) a single-item measure of self-rated health (Jylhä, 2009) assessed on a five-point scale. Prior work supports the reliability and validity of both the PROMIS (Cella et al., 2010, 2019; Pilkonis et al., 2014) and the AUDIT (Meneses-Gaya et al., 2009). We included these measures to provide a better characterization of the sample, for example, regarding the specific types of internalizing distress experienced (i.e., depression versus other), comorbid externalizing symptoms (alcohol being the most common type of externalizing disorder), as well as physical health and somatoform symptoms (Brissette et al., 2003). An additional rationale for including these assessments is that these variables have been found to predict treatment outcomes (Kessler et al., 2017).

Sample Size Justification and Power

We initially powered the trial to be able to detect a statistically significant difference in engagement relative to the 65.1% engagement rate reported in Van Ballegoojien et al.’s (2014) meta-analysis. We used an online (https://sample-size.net/sample-size-conf-interval-proportion/) calculator (Kohn & Senyak, 2021) to estimate the sample size required to test for a statistically significant difference between a sample proportion and the expected value (i.e., 65.1%) at a p value < .05. This analysis suggested we needed to recruit at least 95 individuals to have an adequately powered acceptability trial. We ultimately recruited more participants than was necessary (n = 141) partially due to uncertainty about our ability to retain all individuals. In addition, this was possible because the PI (LL-L) obtained additional funding for an extension assessing the effects of GSH-CBT on natural language metrics of social media data. That subcomponent of the study (i.e., whether GSH-CBT produces effects detectable via social media) is not being used in the current analysis. We used G*Power 3.1 (Faul et al., 2007) to estimate what magnitude of within-person changes we had adequate powered to detect in our 141-participant sample. The results of this analysis suggested that we could detect small-to-medium within-person changes (d = 0.25, at p < .05 and power of 80%) with the achieved sample size (i.e., n = 141).

Analytic Strategy

All analyses were conducted in R version 4.1.2 (see R Core Team, 2013) using the R Studio Graphic User Interface. All code and deidentified data are available on the Open Science Foundation (OSF) site (https://osf.io/j32uw/). First, we report the percentage of participants who progressed from beginning the survey to the end of the study. We followed an intent-to-treat (ITT) approach, analyzing data from all individuals who reached the onboarding call to confirm trial participation.

Multiverse Analysis of Engagement

There are at least two other definitions of trial entry possible in online trials like ours. One is less conservative, considering to be true study subjects only those individuals who completed at least one post-baseline assessment after the baseline eligibility survey. The other is more conservative, considering to be true study participants all individuals who qualified for the study regardless of whether or not they were reachable following the baseline eligibility survey. We also analyzed two different definitions of trial completion: completing 100% of the scheduled GSH-CBT sessions, and completing at least 50% of the scheduled GSH-CBT sessions. Given variability in how study entry and completion could be defined, we conducted a “multiverse” analysis to assess the acceptability of the intervention (see Steegen et al., 2016). In a multiverse analysis, a researcher conducts and presents all possible ways of analyzing data (i.e., here, the engagement rate). Multiverse analysis has been recommended as a way of increasing transparency in psychological sciences by presenting readers multiple versions of the data, as opposed to simply presenting the analyses that give the result with the most favorable effects. We chose a non-inferiority margin of 10% to determine if observed engagement rates were equivalent to the rates from the meta-analysis.

Demographics and Change Over Time

We report baseline demographic and clinical characteristics by presenting means and standard deviations for continuous variables and percentages for categorical variables (see Table 1). To assess the preliminary efficacy of the intervention, we conducted mixed regression models using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) packages in R to regress internalizing symptoms (K6), well-being (WHO-5), ERQ-reappraisal, and ERQ-suppression on time in GSH-CBT. We coded the time variable by dividing each week by six, the total number of weeks in the study. This creates a variable ranging from 0 to 1 (e.g., week 1 is 0.17, week 2 is 0.33, week 3 is 0.50, etc.) where a one-point increase in “time” represents change from baseline to the end of treatment. We calculated effect sizes for the results of these mixed models by using the framework proposed by Feingold (2009) wherein estimated change over the course of the intervention is divided by the baseline standard deviation of the measure. Additionally, we assessed pre–post change in the perceived usability of DWM (i.e., the SUS score) as well as in psychosocial functioning (i.e., WSAS) by conducting a paired-sample t-test comparing the scores at baseline with the scores at week 6. We calculated the effect size of pre-post change in the SUS and WSAS by dividing change in these measures by the standard deviation of their change scores. We calculated 95% confidence intervals (CIs) for both of these d-type effect sizes by using the “cohen.d.ci” function in the psych package in R (Revelle, 2016).

Table 1 Sociodemographic and clinical features of 141 individuals in a trial of transdiagnostic guided self-help

In our analyses of the SUS and WSAS, we used a last-observation carried forward (LOCF) imputation approach to deal with missing session-six scores. In our analyses of the K6, WHO-5, and ERQ, we used hierarchical linear modeling (HLM) to handle missing data (Hox, 2000). We refer to these analyses as the “LOCF-imputed” data.

Missing Data Imputation

Missing data in demographic covariates was minimal, with most individuals providing complete information for most variables. A maximum of six individuals did not fully answer the SUS (4.3%). There was more longitudinal data missing for the K6, WHO-5, and ERQ related to dropout or missing assessments, up to a maximum of 33.3%, over the course of the intervention. Additionally, several WSAS and SUS ratings were also missing, as not all participants completed the week 6 assessment. To address missing data, including missing outcome data, we imputed all missing data using a machine learning algorithm: non-parametric missing value imputation using random forests, with the R package missForest (Stekhoven & Bühlmann, 2012). To preserve the association between the variables (Ginkel et al., 2020), we did not pre-process variables and only minimally recoded them (see OSF). Imputation models that use multiple variables are preferred to LOCF-type imputation models like the one we describe above (Kenward & Molenberghs, 2009; Lachin, 2016). The variables in the imputation model included baseline to week 6 data on the K6, WHO-5, ERQ, WSAS, SUS, PROMIS depression severity, and baseline demographic and clinical data including age, gender, race, marital status, unemployment status, education, income, self-rated health, antidepressant history, and the age at which participants first struggled with internalizing disorder symptoms. We refer to these data as the RF-imputation data. For all of the analyses described above, we present the random forest imputation (RF-imputation) results and the LOCF-imputed results.

Results

Feasibility

A total of 425 individuals started our screening survey, of whom 263 completed it (61.88%). Of these, 198 met our entry criteria. A large number (n = 32, 49.23%) of those who did not meet the entry criteria had symptoms that were too mild (i.e., K6 < 6). A similarly large number did not meet the criteria because of the suicidality exclusion (n = 32, 49.23%). One participant was excluded for not responding to the item probing suicidality. Of the 198 participants who qualified for the study, we were able to reach 141 individuals for an onboarding call (71.21%) to confirm participation. Of those, 119 (84.4%) initiated GSH. Of participants who we reached for an onboarding call, 100 (70.92%) completed at least half of scheduled sessions. 97 (68.79%) Individuals attended the post-treatment assessment. We conducted the “multiverse” analysis calculating the retention rates from all possible definitions of those who entered the study (i.e., qualified, n = 198; reached for onboarding call, n = 141; initiated GSH, n = 119) as well as the two definitions of those who completed the study (i.e., who attended the post-treatment assessment n = 97, or who completed at least half of agreed-upon sessions n = 100). Overall, four of the six engagement rates we calculated were non-inferior to the 65.1% engagement rate from the meta-analysis by Van Ballegooijen et al. (2014) and one was inconclusive (see Fig. 2).

Fig. 2
figure 2

Multiverse analysis of engagement rates for different definitions of trial entry and trial completion for participants undergoing transdiagnostic guided self-help CBT

Sample Demographics

Most participants in the sample identified as cisgender women. Although different income categories were relatively well represented, the sample was somewhat more educated than one would be expect of a representative US sample (see Table 1). On average, participants reported struggling with internalizing disorder symptoms very early in life (M = 16.47, SD = 9.54). Although our internalizing symptom entry criterion was relatively low (K6 \(\ge\) 6), the average participant reported internalizing distress scores substantially higher than the cutoff (K6: M = 10.99, SD = 3.59), as well as relatively low well-being (WHO-5: M = 29.55, SD = 14.65).

Attesting to the clinical severity of our sample, 131 (92.91%) of individuals met the WSAS cutoff for moderate impairment, 125 (88.65%) individuals met the WHO-5 cutoff for depression screening, 87 (61.7%) met the PROMIS depression cutoff for moderate depression, 88 (62.41%) met the WSAS cutoff for severe impairment, and 40 (28.37%) met the K6 cutoff for severe distress.

Acceptability

After reading the consent, participants’ expectations of the usability of DWM were relatively high (RF-imputed: M = 72.18, SD = 13.41; LOCF-imputed: M = 71.94, SD = 13.59). About two thirds of individuals reported scores of 68 or higher (RF-imputed: 62.41%, LOCF-imputed: 60.74%). After 6 weeks of DWM, acceptability (RF-imputed: M = 84.1, SD = 8.599; LOCF-imputed: M = 84.43, SD = 10.17) increased substantially [RF-imputation: Mdiff = 11.89, t(140) = 11.16, p \(<\) .001; LOCF-imputed: Mdiff = 8.259, t(134) = 8.088, p \(<\) .001; see Table 2]. Of those queried at week 6, almost all reported the intervention was usable (i.e., SUS \(\ge\) 68; RF-imputed: 94.33%, LOCF-imputed: 91.75%).

Table 2 Standardized mean differences (SMDs) of within-person changes for 141 participants with internalizing distress treated with transdiagnostic guided self-help CBT

Symptom Reduction

Over the course of GSH-CBT, there were large reductions in internalizing disorder symptoms [RF-imputation: B =  − 4.37, SE = 0.29, t(140) =  − 14.96, p \(<\) .001; LOCF-imputed: B =  − 4.67, SE = 0.4, t(94.38) =  − 11.77, p \(<\) .001]. Similarly, over the course of GSH-CBT, there were large improvements in well-being [RF-imputation: B = 12.7, SE = 1.44, t(140) = 8.8, p \(<\) .001; LOCF-imputed: B = 13.8, SE = 1.95, t(107.4) = 7.055, p \(<\) 0.001]. From baseline to the post-treatment assessment, there were improvements in work and social functioning [RF-imputation: Mdiff =  − 7.87, t(140) =  − 11.91, p \(<\) 0.001; LOCF-imputed: Mdiff =  − 5.48, t(140) =  − 8.256, p \(<\) 0.001]. These changes were large in magnitude (see Table 2; Fig. 3). There were no reports of treatment-emergent suicidality.

Fig. 3
figure 3

Changes in a internalizing distress, b well-being, c cognitive reappraisal, and d expressive suppression in 141 individuals in GSH-CBT

Emotion Regulation

Compared to nationally representative samples (Preece et al., 2021), participants in our study had relatively low levels of self-reported reappraisal use (RF-imputation: M = 4.2, SD = 1.12; LOCF-imputed: M = 4.21, SD = 1.14) but average levels of suppression use (RF-imputation: M = 3.73, SD = 1.32; LOCF-imputed: M = 3.71, SD = 1.33). Over the course of GSH-CBT, there were large increases in cognitive reappraisal use (RF-imputation: B = 0.823, SE = 0.078, t(140) = 10.59, p \(<\) .001; LOCF-imputed: B = 0.891, SE = 0.11, t(93.11) = 8.399, p \(<\) .001]. The reductions in expressive suppression were more modest [RF-imputation: B =  − 0.437, SE = 0.09, t(140) =  − 4.867, p \(<\) .001; LOCF-imputed: B =  − 0.513, SE = 0.12, t(101) =  − 4.339, p \(<\) .001; see Table 2].

Discussion

We conducted a fully remote nationwide clinical trial to assess the feasibility, acceptability, and preliminary efficacy of DWM, a GSH-CBT, in a United States sample. Our results demonstrate that DWM can be delivered in a fully remote fashion, consistent with other studies suggesting that nationwide recruitment of individuals with internalizing symptoms is feasible (Arean et al., 2016). Our multiverse analysis suggested that we were able to retain anywhere from 48 to 79% of individuals. The most conservative retention estimate (48%) was calculated with study entry defined as simply qualifying for the study after completing the baseline survey and study completion defined as completing more than 50% of scheduled GSH-CBT sessions. This conservative approach suggested that we lost half of study participants to dropout, resulting in lower retention rates than we might expect based on meta-analyses of GSH-CBT. All other definitions of retention rate that we tested suggested yielded comparable with or higher than in prior work (70–79%).

Nonetheless, we observed that with every “step” in the process of engaging individuals in GSH-CBT, there was a sizeable reduction in participants who remained engaged with the intervention. This has also been observed in studies of individual face-to-face psychotherapy (Krendl & Lorenzo-Luaces, 2021). These findings imply that in order to maximize the reach of GSH-CBT, we need to reduce as many of the barriers of its initiation as possible (e.g., shorter screening times, more relaxed entry criteria). While human engagement generally facilitates desired outcomes and promotes adherence to self-help approaches like bibliotherapy or internet-based treatment (Cuijpers et al., 2019), requiring human engagement (e.g., as in reaching the welcome call for the current trial) may actually be a barrier for some individuals. Over the 6-week study period, individuals who engaged in the DWM GSH-CBT reported very large decreases in internalizing distress, large increases in well-being, large improvements in functioning, and improved perceptions of the usability of the intervention. Additionally, we also found that participants experienced large increases in cognitive reappraisal and medium decreases in expressive suppression. These findings continue to underscore the promise of GSH-CBT to address the burden of untreated psychopathology. For example, it may be beneficial to target dissemination of self-help CBT material at individuals currently on waiting lists for more traditional psychological services (Peipert et al., 2022a, b).

The level of attrition we reported underscores the need to optimize engagement in GSH-CBT. Given that human support has been the only replicable predictor of engagement in GSH-CBT (Bur et al., 2022; Cavanagh et al., 2010; Furukawa et al., 2021), it would behoove the field to explore ways of optimizing the human element of GSH-CBT. For example, in face-to-face CBT, greater session frequency (e.g., twice-weekly instead of once-weekly) improves outcomes (Bruijniks et al., 2020). It is possible that increasing the level of engagement in GSH-CBT may improve outcomes or engagement. In our study, most dropout occurred after individuals qualified for the study when they could not be reached for an onboarding call. The moment in which individuals seek psychological services may be a moment when they are most receptive for such services, and waiting time following that moment (e.g., the wait for the onboarding call) may represent an engagement obstacle (Krendl & Lorenzo-Luaces, 2021). One promising way to increase engagement, then, may be to immediately make very brief interventions, such as single-session interventions (SSIs), immediately available to treatment seekers. SSIs have been widely studied for youth psychopathology (Schleider & Weisz, 2017), and emerging research supports their use in adults (Wasil et al., 2021a, b, c). A logical future direction is to study the combination of SSI and other forms of GSH-CBT in a “stepped care” fashion (Lorenzo-Luaces et al., 2017) to investigate whether immediate access to an SSI increases subsequent engagement with treatment (the GSH-CBT). Given individual differences in engagement and response to these interventions, it will also be worthwhile to explore predictors of engagement and response that could be used for the purposes of risk stratification (Lorenzo-Luaces et al., 2017, 2020; Lorenzo-Luaces et al., 2021b). Male gender seems to be a predictor of low engagement, as does lower education level and younger age (Karyotaki et al., 2015). Future larger-scale studies should explore a greater number of potential predictors (Kessler et al., 2017). Another option, then, may be to target GSH-CBT efforts at the individuals most likely to adhere to them, allocating treatment resources on the basis of expected engagement.

The major limitation of the current study is its lack of a control group; however, the changes we observed in the K6 and WHO-5 were of the same magnitude as in the large trial by Tol et al. (2020). An RCT is nonetheless warranted, because it is possible that the observed changes in symptoms and emotion regulation are attributable to the natural passage of time, placebo effects, or participant characteristics. Another limitation is that the sample consisted primarily of women. The formative work by the WHO regarding DWM also saw difficulty with recruiting men (Tol et al., 2020). Men are less likely to meet the criteria for internalizing distress than women, and large-scale studies suggest that even when men have internalizing distress, they are substantially less likely to seek treatment when compared to women (Rayner et al., 2021). In addition, non-Hispanic Black, Hispanic, Asian, and other racial–ethnic minority individuals were underrepresented relative to the general US population. Although these individuals were underrepresented in our sample, a meta-analysis by our group (De Jesús-Romero & Lorenzo-Luaces, 2002) suggests that these rates of representation of racial-ethnic diversity in our trial are consistent with other trials of digital interventions. Future work should explore how to increase the engagement of cisgender men and racial–ethnic minoritized individuals in GSH studies. Finally, we measured the acceptability of the intervention by using (1) self-rated usability, (2) self-rated satisfaction, and (3) compliance with the guidance. Thus, a limitation of the study is that we did not measure acceptability of the ACT-based content nor did we systematically address adherence or engagement with the GSH-CBT material (e.g., reading comprehension, percent of time practicing mindfulness).

Several strengths of the study are worth discussing. First, we recruited individuals from all over the US to maximize generalizability. Additionally, we used very relaxed entry criteria to maximize the representativeness of our sample (Lorenzo-Luaces et al., 2018a, b). We measured a variety of outcomes including a transdiagnostic measure of mental health disorder symptoms, two features of emotion regulation, aspects of positive mental health functioning like well-being and functioning, and explicit assessments of participants’ perceptions of the usability of the intervention as well as their satisfaction with it. Almost all participants had moderate functional impairment, most met cut-off criteria for disorder screening in the various measures we used to characterize psychopathology, and about two thirds met the criterion for severe impairment on the WSAS. These results suggest we were able to recruit individuals with relatively severe clinical profiles.

Interestingly, during the intervention, participants reported more change in cognitive reappraisal than in expressive suppression. Cognitive reappraisal is a basic emotion regulation strategy that is the equivalent of cognitive restructuring in CBT (Lorenzo-Luaces et al., 2015, 2016). In most forms of CBT, restructuring is a recommended way of reducing cognitive distortions to improve mood but is not formally addressed by DWM or ACT-based interventions except via distancing or defusion exercises. Expressive suppression is analogous to the idea of experiential avoidance in ACT and antithetical to acceptance (Hofmann & Asmundson, 2008). Thus, it was unexpected that suppression changed less than reappraisal. It may be that a more direct measure of ACT-relevant processes would capture change better than the ERQ. However, in the larger DWM trial by the WHO, improvement in an ACT-specific measure in DWM versus treatment as usual was relatively modest (d = 0.42; Tol et al., 2020) and not maintained over a follow-up period (d = 0.09). One possibility is that DWM may facilitate symptom change through processes other than those implied by the ACT model. Process-outcome research clarifying the association between changes in emotion regulation and changes in symptoms is needed, including exploration of alternative processes that may explain changes in DWM such as normalization of distress, working alliance, behavioral activation, or other potential mechanisms. The efficacy of GSH-CBT is well established relative to control conditions and, to some extent, relative to individual therapy (Cuijpers et al., 2019). The areas of uncertainty remaining in relation to GSH-CBT include its dissemination in real world settings, best staging practices for GSH-CBT relative to other treatments, identifying who engages and responds to these interventions, and isolating mechanisms though which these interventions achieve their effects.