Introduction

Cognitive-behavior therapy (CBT) is the most researched and established psychotherapy for depression and other emotional disorders (Cuijpers et al., 2021; Hofmann et al., 2012). The growing awareness of the impact of common mental disorders across the world is now calling for psychotherapies that can be delivered at scale (Herrman et al., 2022). Various delivery methods, believed to be more efficient than the traditional face-to-face individual format, have been tried for CBT. Of these, the self-help CBT appears most promising (Cuijpers et al., 2019), and with the rapid popularization of the internet we are witnessing a growing number of internet CBT (iCBT) applications (Karyotaki et al., 2021).

While CBT now encompasses a range of cognitive and behavioral intervention skills (Furukawa et al., 2021), cognitive restructuring (CR) remains the central technique deeply rooted in the cognitive model of emotional disorders (Beck et al., 1979). The theory posits that it is not the stressful situations themselves but rather their appraisal which leads to negative emotions. In CR, the patients are encouraged to monitor their experienced emotions and their thoughts amid the stressful situations, and then prompted to challenge their initial, automatic and often dysfunctional thoughts. The pre-requisite of CR then is the accuracy of the so-called thought record, in which patients record their emotions and thoughts: Unless the patients can accurately identify the thoughts that underlie their emotions, i.e. that directly lead to their emotions, challenging the thoughts would not easily lead to a reduction in their negative emotions. In the traditional face-to-face CBT sessions, psychotherapists achieve such refinement in identified automatic thoughts through Socratic questionings such as the downward-arrow technique.

In the self-help applications, however, the accuracy of the thought record is largely left to the acumen of the users and no work-up of automatic thoughts has been possible. This difficulty may partly explain the recent findings from the component network meta-analysis of iCBT in which CR was not found to be particularly helpful, in contrast with more directive approaches such as behavioral activation or problem solving (Furukawa et al., 2021). In order to augment the efficacy of CR in iCBT, we need to find ways to help users of iCBT in identifying the underlying automatic thoughts that correspond with their experienced emotions.

Recent advances in the natural language processing (NLP), mainly spurred by the emergence of Large Language Models (LLMs), may bring about a sea-level change in this regard. LLMs learn universal language representations from large volumes of text data using self-supervised learning and transfer this knowledge to more specific applications through fine-tuning (Subramanyam Kalyan et al., 2021; Wang et al., 2022). In this study we used thought records from two previous randomized controlled trials of iCBT and examined if the Text-to-Text Transfer Transformer (T5), one of the most advanced LLMs, can help identify thought records in which users may be having difficulties in accurately identifying the automatic thoughts that underlie their emotions. First we trained the T5 to predict the emotions associated with each automatic thought. Next we tested the validity of these NLP-based predictions by comparing them with the human experts’ judgements. We finally tested the validity of our assumption that better matched thought-emotion records enable more effective CR by comparing the results of CR of matched vs non-matched thought records.

If the T5 can identify thoughts that may not correspond with the experienced emotions, the smartphone CBT app can issue prompts to reconsider or refine the thoughts recorded, which would then enable more effective cognitive restructuring and ultimately lead to a greater reduction in depression through the therapy.

Methods

Dataset

We used automatic thought records submitted by participants who had taken part in the two randomized controlled trials of smartphone CBT.

One is the FLATT (Fun to Learn to Act and Think through Technology) trial, which examined the add-on effects of the smartphone CBT app called “Kokoro App” (“Kokoro” means the mind in Japanese) to the pharmacotherapy over the pharmacotherapy alone among 164 patients with treatment-resistant depression (Mantani et al., 2017). “Kokoro App” contains five active components for CBT, namely psychoeducation, self-monitoring, behavioral activation, cognitive restructuring and relapse prevention.

The other is the HCT (Healthy Campus Trial), which used the smartphone CBT app called “Resilience Training App” among 1626 university students, of whom 1093 scored five or higher on the Patient Health Questionnaire-9 (PHQ-9) at baseline (Sakata et al., 2022). “Resilience Training App” is an expanded version of “Kokoro App”, including components for structured problem solving and assertion training, in addition to the original five. The HCT was a fully factorial trial to assess the specific efficacy of the five CBT components of self-monitoring, cognitive restructuring, behavioral activation, problem solving and assertion training. All the participants undertook psychoeducation and relapse prevention but were randomly assigned to presence or absence of the five remaining components, therefore to 2^5 = 32 different combinations.

Variables

In the self-monitoring and cognitive restructuring components, the participants learnt the cognitive model of human responses to stressful situations and filled in “mind maps,” a graphical version of automatic thought records that recorded situations, feelings, thoughts, body reactions and actions. The participants entered free texts into situations, thoughts, body reactions and actions, and chose one of the four basic feelings (sad, anxious, angry or happy) and rated its intensity in six levels of 0–6 (Fig. 1a).

Fig. 1
figure 1

Screenshot of a mind map and a thought challenge

The participants could then use one or more of their mind maps to practice cognitive restructuring. The app provides four items to help the users challenge their automatic thoughts: “Fact glasses,” “% calculator,” “Friend’s call,” and “What now microphone.” (Fig. 1b) “Fact glasses” ask the patients to come up with alternative thoughts based on the reality that do not match the automatic thoughts. “% calculator” asks the users how much they believe in the automatic thought (x %) and then asks what possibilities there are in the remaining (100 − x) %. “Friend’s call” asks a hypothetical question “What advice would you give to your best friend if they told you the very same thought?” And “What now microphone” asks the users “What would be the next best thing you can do if you assumed that the automatic thoughts were true?” The participants can use one or more of these items to develop alternative thoughts and re-evaluate their feelings if they had thought otherwise.

All the participants’ entries into mind maps and thought challenges were automatically uploaded to the remote server and are used in the current analyses.

Analyses—Prediction

We used the Japanese Text-to-Text Transfer Transformer (T5) model (https://huggingface.co/sonoisa/t5-base-japanese), pretrained on three Japanese corpuses (altogether ca 100 GB) of Wikipedia (https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8), OSCAR (https://oscar-corpus.com/) and CC-100 (http://data.statmt.org/cc-100/), on Python (Version 3.9.0).

We applied the T5 to examine the match between the self-reported feeling and its automatic thought. Assuming that the participants generally correctly identify automatic thoughts, we conducted threefold cross-validation for the model to predict the feelings based on automatic thoughts, using 67% of the sample to train (57%) and fine-tune (10%) the model and the remaining sample to test the model. We then calculated the overall accuracy, precision, recall (sensitivity) and F1-score. Accuracy, precision, recall and F1-score are defined as follows in a 2*2 confusion matrix.

$${\text{Accuracy}}\, = \,\left( {{\text{TN}}\, + \,{\text{TP}}} \right)/\left( {{\text{TN}}\, + \,{\text{FP}}\, + \,{\text{FN}}\, + \,{\text{TP}}} \right)$$
$${\text{Precision}}\, = \,{\text{TP}}/\left( {{\text{FP}}\, + \,{\text{TP}}} \right)$$
$${\text{Recall}}\, = \,{\text{TP}}/\left( {{\text{FN}}\, + \,{\text{TP}}} \right)$$
$${\text{F1}}\,{\text{score}}\, = \,{2}*{\text{Precision}}*{\text{Recall}}/\left( {{\text{Precision}}\, + \,{\text{Recall}}} \right)$$
  

Predicted feeling

Negative

Positive

Actual feeling

Negative

TN

FP

Positive

FN

TP

Precision quantifies what proportion of the predicted positive findings are truly positive, recall quantifies what proportion of the truly positive cases can be predicted, and F1 score shows how well the model performs while balancing the two. All these four indexes range between 0 and 1, with the values closer to 1 denoting better performance.

We excluded some of the free-text entries of automatic thoughts that were very brief (five or less Japanese characters in length, which would roughly correspond with two or less words in English). We reasoned that it would be difficult for the T5 to predict the underlying feeling in such cases and that such automatic thoughts were unlikely to be truthful descriptions of the thoughts leading to the feeling. The target feelings were treated as multinomial categories.

Analyses—Validation

We examined the validity of the model through two tests.

First, we randomly selected 40 thought-feeling pairs each from those in which the T5 predictions were correct and from those in which they were incorrect. We then asked three cognitive behavior therapists (two clinical psychologists (MS and MH) and one psychiatrist (TAF)) to independently guess the feeling behind each of these 80 automatic thoughts. We considered the feelings agreed upon by the independent raters as the gold standard judgements. We hypothesized that the participants’ self-reported feelings would be more concordant with the human therapists’ judgements as inferred from the automatic thoughts when they were correctly predicted by the T5 than when not.

Second, we hypothesized that the mind maps (thought records), for which the T5 model successfully matched the feeling and thought, would lead to greater reduction in sad, anxious or angry feelings through cognitive restructuring than those for which the T5 failed to match the feeling and the thought. If this hypothesis was confirmed, it would show that the mind maps in which the T5 failed to predict the feelings from the thoughts could benefit from re-assessing the thoughts behind the self-reported feelings.

Results

Participants and Data

The 164 participants in the FLATT trial, either in the immediate or in the delayed (waitlist) intervention condition, contributed 4369 mind maps. Of the 1626 participants in the HCT trial, 1134 were assigned to self-monitoring and/or cognitive restructuring components and contributed altogether 2813 mind maps. Table 1 presents the baseline characteristics of these participants. The participants of the FLATT trial were older, more severely depressed at baseline and submitted more mind maps per person than those of the HCT trial. The sex distributions were comparable.

Table 1 Baseline characteristics of the FLATT and HCT trials who had contributed one or more mind maps

Prediction

Table 2 presents the confusion matrix between the self-reported feelings and the feelings as predicted by the threefold application of the T5 model to the feeling-thought pairs. Table 3 shows the precision, recall and F1-score for each feeling. T5 was able to correctly predict the self-reported feeling based on the automatic thought in 5280 (the overall accuracy of 73.5%) but failed to do so in 1902 (26.5%) cases out of 7182 feeling-thought pairs.

Table 2 Confusion matrix between the self-reported vs predicted feelings
Table 3 Precision, recall and F1-score

Validation

We first established the gold standard feeling as estimated from each automatic thought based on the agreed-upon assessment among the three cognitive behavior therapists. The inter-rater reliability of the three independent raters’ assessments was Fleiss’ kappa of 0.74 (95%CI 0.61 to 0.87) for the 40 randomly selected sample of congruent thought-feeling pairs and 0.54 (0.37 to 0.70) for the 40 randomly selected sample from incongruent pairs. We therefore considered the feeling as agreed upon by at least two of the three raters as the gold standard.

The self-reported feelings matched these gold standards in 36 out 40 cases when the T5 algorithm correctly predicted them, but only in 15 out of 40 cases when it could not (90% vs 37.5%, diff = 52.5%, 95%CI of diff = 34.8 to 70.2%, p < 0.001).

We next examined the effects of cognitive restructuring for the automatic thoughts when the T5 was able to correctly predict the feeling or when not (Table 4). We limited these analyses to the three negative feelings of sad mood, anxiety or anger. All items tended to produce greater reduction in the negative feelings for the thought records where the T5 successfully matched the self-reported feeling and the thought and taken together the differences were statistically significant (standardized mean difference 0.11, 95%CI 0.03 to 0.19).

Table 4 Reduction in negative emotions after cognitive restructuring

Discussion

We applied the T5, the most advanced NLP using the transformer-based pre-trained language model, to 7182 thought records (thought-feeling pairs) provided by people with major or subthreshold depression. In the threefold internal validation, the T5 was able to correctly predict the self-reported feeling based on the recorded thought in 73.5% of the thought records. Thought-feeling pairs which the T5 correctly matched showed greater agreement with the gold standard rating as identified by human cognitive-behavior therapists than those pairs which the T5 showed to be discordant. Moreover, when submitted to cognitive restructuring, the former records of thought-feeling pairs led to greater reduction in negative emotions than the latter.

The artificial intelligence (AI) in general and the NLP in particular are finding wider and wider applications across the society, in medicine and in mental health (Wang et al., 2022). A recent review of the use of AI and NPL in mental health has identified five major categories of their usage: to extract clinical symptoms, to classify severity of illnesses, to compare different therapies, to provide psychopathological clues and to challenge the current nosography. For example, an NLP system may aim at detecting suicidal tendencies from the electronic health records, at predicting their severity, at differentiating more vs less adequate deliveries of therapy, or at identifying new psychopathologies based on the use of languages (Le Glaz et al., 2021). Our current study to apply the NLP to rate the quality of the automatic thought records may belong to the third category, namely to distinguish different levels of appropriateness of the therapeutic process.

One of the earlier attempts to apply the NLP to cognitive therapy was a system to classify dysfunctional thoughts into their categories such as “all-or-nothing,” “negative predictions,” “discounting the positive” etc. The researchers collected examples of dysfunctional thoughts from cognitive therapy textbooks and used their classification in the supervised learning. The system seems to have been partially successful but not yet to a clinically applicable degree (Wiemer-Hastings et al., 2004). As the ability of the NLP progresses, more complex systems have been proposed. Burger et al. used the NLP software to classify automatic thoughts provided by crowd-sourced volunteers in terms of their matching schemas as identified by human experts. Among the several models trialed, those based on the recurrent neural networks (RNNs) appeared to perform the best but this study did not provide any clinical applications (Burger et al., 2021). Kawakami et al. developed an NLP-based system to rate the quality of thought records. They trained the RNN to rate the five thought record components (event, thought, mood, behavior, physical symptoms) in accordance with the experts’ judgment of their appropriateness. They achieved an accuracy between 0.79 and 0.84 for the five components in the internal cross-validation. The authors are now using this system in their iCBT app to provide feedback to the users when their responses were rated as unlikely to be appropriate by the RNN (Kawakami et al., 2021).

Distinctive from these previous studies, our system was able to demonstrate the discriminatory power of the NLP-based system through external validation. Not only were we able to demonstrate satisfactory discrimination in the internal threefold cross-validation, the mismatching thought-feeling pairs as judged by the NLP system had poorer agreement with the external expert ratings and also led to smaller effectiveness when submitted to cognitive restructuring than those for which the NLP system could correctly predict the thought-feeling matching. These findings suggest that prompting the participants to update/revise the thought that underlie their feelings when the NLP found mismatch between them might be able to lead to more accurate identification of automatic thoughts, to enable more effective cognitive restructuring, and eventually to lead to greater reduction in depression after the treatment.

Some weaknesses of the current study may include the following. First, while showing evidence of internal validity and some external validity, the current study remains essentially exploratory and the ultimate proof of the value of the system is yet to be seen by trialing it in a clinical trial. Second, the NLP is showing very rapid evolution. Although the T5 is currently one of the most advanced NLP systems, newer and better models are being developed. The data to be fed into the NLP are also accumulating as we offer the current CR module to the users. New analyses using a different LLM and an even larger dataset of thought records may improve the performance of the system. However, the model developed in this study already appears promising and ready to be tested out with the real-world users.

We are currently developing a smartphone CR module which incorporates the AI system as demonstrated by the current study. It will provide some feedback to the app users to revisit their automatic thought when there was discordance between the self-reported feeling and the feeling as predicted by the AI system based on the automatic thought. We are currently planning to test out this system in the platform trial of various CBT modules (Furukawa et al., 2023) and to see if such advice can ultimately augment the effect of the iCBT.