Keywords

1 Introduction

Psychological interventions necessitate a high degree of personalization tailored to individual patients’ unique characteristics and needs [1]. Therapists most often rely on their clinical experience to adapt their approach. This adaptability, however, results in substantial variation in treatment approaches and outcomes, with therapists often being unaware of the consequences of this variation for the individual patient. For example, therapists may underestimate patients’ negative experiences and fail to anticipate adverse outcomes, such as dropout from treatment programs [7, 12, 18]. This is unwanted on both an individual and systemic level, calling for the optimization of treatment personalization processes.

To facilitate treatment individualization, mental health systems have increasingly turned to routine measurements obtained through self-report questionnaires, administered throughout treatment processes [9]. These questionnaires enable the tracking of patient progress and have been demonstrated to improve symptomatic outcomes when combined with feedback to therapists [1]. This approach produces vast amounts of patient generated data, providing a valuable resource for developing data-driven decision support tools. These tools can empirically identify patients at risk of negative outcomes, that allow therapists to intervene proactively and adjust their approach.

Machine learning (ML) algorithms present a promising avenue for leveraging the data derived from self-report questionnaires to create predictive models that enhance personalization and effectiveness in psychological treatments [4]. In a comparative study of 45 ML algorithms and ensembles, researchers aimed to predict dropout from cognitive-behavioural therapy before the first session in clinically labelled data. The top-performing ensemble model achieved an AUC (Area Under the Receiver-Operator Characteristic Curve) of 0.6581, indicating modest accuracy [3]. The authors note that this result might not seem very precise, but that the best model significantly outperformed a generalized linear model (GLM). However, tools with limited accuracy may still provide clinical value to therapists when they are used to involve patients collaboratively to modify treatment strategies to mitigate the risk of adverse outcomes. However, it is key that both therapists and patients understand the reasoning behind these predictions to jointly derive benefits from such models. Consequently, the explainability and interpretability of models becomes a crucial component when designing ML decision support tools for clinical use - an emerging field known as eXplainable AI (XAI) [2].

This study describes the development and validation of a ML model using granular data from patient self-reports to predict treatment outcomes in a historical dataset. Further, it examines the potential of ML-based predictive tools to enhance therapists’ capacity to personalise treatment strategies, decrease dropout rates, and improve patient outcomes. Results are presented using visualisations to demonstrate possible routes to increase explainability and interpretability.

2 Methods

2.1 Sample

The data used in this study was sourced from the digital measurement-based care provider, Mirah, Inc, as part of an international research project [15]. This anonymized data was collected during routine practice in various outpatient clinics across the United States. The sample consisted of patients receiving treatment between March 15th, 2016, and February 17th, 2022. The data set encompasses a diverse patient group, each seeking help for their unique mental health challenges.

Mirah routinely gathers anonymized data to drive improvements in their software and contribute to ongoing research. As the data was fully anonymized at the patient level, there was no need for written informed consent in this instance.

2.2 Instruments

Norse Feedback (NF) is a clinical feedback system developed at the District General Hospital of Førde, designed to facilitate personalized therapeutic interventions based on clinicians’ and patients’ needs [14, 16, 17]. The development of NF involved a rigorous process of item generation, testing, and refinement through clinical implementation studies, ensuring its relevance and effectiveness in clinical practice [8].

The NF system used in this study (version 1) comprises a maximum of 88 items (Appendix 1), which load onto 18 dimensionsFootnote 1. Patients respond to the items using a seven-point Likert scale (Not at all true for me – True for me), allowing for nuanced feedback on their experiences and symptoms. The measurement provider did not use NF item 55 for the administrations comprising this dataset.

A unique feature of NF is its dynamic and adaptive nature, employing patient-adaptive computer logic to determine which dimensions are relevant to a particular patient. This is achieved by utilizing trigger items for each dimension; based on the patient’s responses, certain dimensions may be opened or closed, ensuring the assessment remains focused on the patient’s specific needs. This adaptive approach results in some dimensions being absent during certain administrations, as they are deemed less relevant to the patient’s current condition.

2.3 Description and Preparation of Data

The initial dataset consisted of n = 11,470 patients and k = 2,065 variables with a high degree of missingness. These variables included various process data and measures other than NF that only select patients had completed making them unsuitable for inclusion in further analysis. Following a rigorous process of dataset cleaning and variable selection, the final dataset included k = 110 variables (88 NF items, 18 NF dimensions, recent life changes (binary), in-treatment status at the first assessment (binary), age, and gender) for n = 11,663 patients that were 18 years or older at the time of assessment. For the recent life changes variable, a change in one or more of the multiple-choice alternatives (employment status, ER visit, housing status, medications taken, and relationship status) resulted in the variable being coded as 1. The data was received unlabelled from the data provider and contained no clinical information about patient outcomes.

We applied a practical approach to labelling the outcome variables estimated treatment length and dropout in collaboration with clinical expertise. The dataset was randomly divided by subject IDs into a training dataset and a test dataset where 20% of patients were reserved for the test dataset. Patients missing more than 90% of observations across all variables for the first assessment were excluded resulting in n = 8850 in the training dataset and n = 2245 in the test dataset. The estimated length of treatment was encoded as a count variable, represented by the number of unique assessments for each patient. A binary outcome variable, “dropout,” was encoded as 1 when the estimated length of treatment was less than three. Values for each of the NF dimensions were calculated as the average of the corresponding item scores, ignoring missing values. Dimensions with reversed scores (Alliance, Attachment, Connectedness, Resilience and Social Role) were transformed so low scores always represent best outcomes to enhance explain ability and visualizations.

2.4 Machine Learning Algorithm and Missing Data Strategy

While variable selection helped reduce the overall missingness in our dataset, a considerable number of missing observations persisted. To address this issue, we opted for a machine learning (ML) algorithm that can effectively handle missing data. Among the various ML algorithms, Gradient Boosted Decision Trees (GBDT) have shown remarkable capabilities in minimizing errors through the gradient descent algorithm employed in sequential models [6]. One such GBDT algorithm, eXtreme Gradient Boosting (XGBoost), has consistently demonstrated superior performance in numerous problems involving tabular data [5].

In the study by Bennemann et al., GBDT algorithms ranked among the top-performing single models when comparing 45 ML models and ensembles for dropout prediction [3]. The XGBoost algorithm offers several advantages, including consistent performance, high speed, lack of necessity for regularization, interpretability, and efficient handling of high-dimensional data, which reduces the need for feature selection. Additionally, XGBoost incorporates an in-built mechanism for handling missing data, making it particularly suitable for our dataset. Another important feature of GBDTs relevant to our data is the handling of correlated variables. As we use both NF items and the NF dimensions in the dataset there will be multicollinearity between variables. GBDTs are particularly robust to multicollinearity due to the decision tree design [22].

However, the optimal performance of the XGBoost algorithm requires the tuning of numerous hyperparameters, which can be computationally demanding. In order to strike a balance between the model’s performance and computational efficiency, we employed a systematic hyperparameter optimization strategy [21].

To enhance the explainablility of the XGBoost models, we incorporated the SHapley Additive exPlanations (SHAP) method for variable impact analyses in the final models and visualization of the ML model prediction process [11]. SHAP values offer a unified measure of feature importance grounded in cooperative game theory, providing a consistent and locally accurate interpretation of the predictions made by complex ML models. By implementing SHAP values in our analysis, we aimed to provide insights into the variables that significantly contribute to predicting dropout risk and symptomatic outcomes, thereby facilitating interpretability and a deeper understanding of the factors influencing treatment personalization and informing clinical decision-making.

2.5 Development and Validation of Prediction Models

Defining Clinical Objectives.

In collaboration with clinical psychologists at the District General Hospital of Førde, we defined four prediction tasks using data from the first patient assessment:

  • Risk of dropout

  • Length of treatment

  • Probability of completing the predicted treatment length

  • Outcomes on the NF clinical dimensions, given a completed treatment length

Software.

All data analyses were performed using R [19] in the RStudio software for Windows [20]. Individual packages used for data analysis are described in the following sections.

Model Development and Baseline Comparisons.

For each prediction task, we established baseline models and trained an ML model. Hyperparameter optimization and cross-validation were performed using the caret package [10], and the final models were trained using the xgboost package [5]. The performance of the ML models was assessed in the out-of-sample test dataset using appropriate evaluation metrics and compared with the baseline model.

Task 1: Dropout Risk.

We calculated the baseline probability of dropout using the overall dropout rate for the full training dataset. The baseline model was validated using a Monte Carlo simulation with 1,000 repetitions on the test dataset. An XGBoost model was trained and validated, with performance compared to the baseline model using the area under the receiver operating characteristic curve (AUC), Positive Predictive Value (PPV) and Negative Predictive Value (NPV).

Task 2: Length of Treatment.

The training dataset was filtered for patients not labelled as dropouts. Outliers over the 95th percentile on the estimated treatment length variable were removed. The mean estimated treatment length in the resulting training dataset was used as the baseline prediction. An XGBoost model was trained and validated using the root mean squared error (RMSE).

Task 3: Probability of Completing Predicted Treatment Course.

Patients were deemed to have completed a treatment course if the estimated treatment length was equal to or surpassed the predicted treatment length minus one. Those with eight or more assessments were considered to have completed the series. Predicted treatment lengths were adjusted to fall within a clinically relevant range, set to five for predictions under five and capped at 12 for predictions over 12. The baseline model employed the probability of treatment completion for patients with three or more assessments from the full training set. For baseline model validation on the test dataset, we utilised a Monte Carlo simulation. We compared the baseline model to the performance of the trained and validated XGBoost model using AUC, PPV and NPV.

Task 4: NF Clinical Dimensions Outcomes.

For this task, we excluded NF items from model training due to the high computational demand of a high number of variables. Consequently, after removing the 88 NF items, the training and test datasets contained 22 variables. For the baseline model, we forecasted the average outcome for all patients in the training set undergoing a specific treatment length. For example, to predict the Attachment outcome for a patient with a predicted treatment length of eight sessions, we utilised the mean Attachment outcome for all patients with an estimated treatment length of eight. This resulted in 108 mean predictions for all combinations of predicted treatment length (i = 6, data for treatment lengths 8/9 and 10/11 were grouped for adequate training data) and outcome variables (dimensions = 18). We then trained 108 XGBoost models on the training set for each combination of outcome and predicted treatment length. A key aspect of the NF system is its ability to adapt to patient needs, meaning less relevant dimensions for patients may be closed. To account for this, outcome predictions for dimensions with a first assessment score of 1 were set to 1, and those with a score below 1.5 were set to 1.5 for both models. We assessed model performance using RMSE.

Model and Hyperparameter Optimization Strategy.

For this project we tuned the following XGBoost hyperparameters for optimal performance: eta – step size shrinkage used in the boosting process to reduce feature weight to avoid overfitting (learning rate - more conservative with lower values), max_depth – maximum depth of a tree, min_child_weight – minimum number of instances needed to be in each node to implement partition (more conservative with higher values), colsample_bytree – the subsample ratio of dataset columns to use when constructing each tree (more conservative with lower values), subsample – subsample ratio of the training instances for constructing each tree (more conservative with lower values), and gamma - minimum loss reduction required to make a further partition on a leaf node of the tree (more conservative with higher values) [23]. For tasks 1 and 3, we optimized XGBoost hyperparameters in the training dataset using 10-fold cross-validation through a grid search, conducted in a stepwise fashion to minimize computational expense. Each step tested all combinations of a set of parameters to identify optimal values. We adjusted the parameters in the following order: (1) eta and max_depth, (2) min_child_weight, (3) colsample_bytree and subsample, (4) gamma, (5) eta, and (6) nrounds. For task 2, we began with parameters resulting from tuning task 1 and manually adjusted for optimal performance. For task 4, we adopted a pragmatic approach, selecting a single set of hyperparameters for all models. We applied scaled weighting to binary outcomes during model training to adjust for class imbalances. The loss functions for optimization were the default ones for the prediction task— error for tasks 1 and 3 and RMSE for tasks 2 and 4.

Model Explainability.

We used the shapviz package [13] to calculate SHAP values for the trained models. Both global and local explanations are presented, with global explanations assessing the average impact of variables on predictions, and local explanations gauging the influence of the variable values for a single patient on a particular outcome prediction.

3 Results

The modelling strategy resulted in two datasets used in the analyses, a summary of the sample characteristics for these datasets is provided in Table 2. Notably, about half of the patients were already in treatment when they responded to their first assessment.

Table 1. Values for hyperparameters provided to grid search.
Table 2. Dataset characteristics

Description of Training Data.

The dataset exhibited a high rate of attrition, with 51% of patients in the training data completing fewer than three assessments. The estimated length of treatment within the training dataset demonstrated a widely dispersed distribution, with values ranging from a minimum of 1 to a maximum of 148. The amount of data missing from the training set was substantial. Despite a reduction in overall missing data following data cleaning, missingness persisted especially among certain variables, some of which displayed over 90% missingness. Some of this is due to the adaptive nature of the NF. For instance, the Alliance scale is not asked at the first administration, so only patients already in treatment responded to these items.

Fig. 1.
figure 1

Training dataset characteristics: (a) Distribution of estimated treatment length for patients, (b) proportion of missing data for Norse Feedback (NF) items (items with more than 90% missingness: N_49, N_56, N_86, N_54, N_84, N_10, N_21, N_25), (c) proportion of missing data for NF dimensions. (Color figure online)

Task 1: Dropout Risk.

The predictive accuracy of the XGBoost model in the test dataset was modest (AUC 0.624, PPV 0.612, NPV 0.558), but outperformed the baseline model (AUC 0.505, PPV 0.522, NPV 0.478). The ROC curve for this model is presented in Fig. 4 (a). The three most important variables in the model were Norse Item 12 (N_12) – “I feel that my therapist understands me and understands why I am in treatment now”, the Resilience dimension and Norse Item 16 – “People have told me they are worried about my drinking and/or drug use” (N_16). Refer to Appendix 1 for a full description of Norse Items and dimensions.

Fig. 2.
figure 2

Task 1 - Dropout prediction: (a) Overall 10 most important variables for predictions and (b) impact on global model of variable values. High values (dot colour) indicate patients’ endorsement of problem or item statement, low values represent absence of problems or disagreement with item statement. Missing values are visualised as grey dots. Negative SHAP values indicate decreased risk of dropout, positive SHAP values indicate increased risk prediction.

Fig. 3.
figure 3

Task 2 – Treatment length prediction: (a) Overall 10 most important variables predictions and (b) impact on global model of variable values, (c) distribution of predictions and test set outcomes. Only outcomes < 30 assessments are included.

Task 2: Length of Treatment.

After filtering for dropout (n = 4507) and estimated treatment length outliers (n = 229; identified at the 95th percentile post-dropout removal), 4114 patients remained within the training dataset for this task. The XGBoost model's predictive accuracy was modest (RMSE = 7.731), only marginally outperforming the baseline model (RMSE = 7.88), which predicted the mean estimated length of treatment (6.93 assessments). The three most important variables in the model were recent changes (C_1), Norse item 12 (N_12), and Norse item 13 – “I feel that my therapist accepts me as a person” (N_13). Figure 3 (c) presents the distribution of the XGBoost treatment length predictions alongside the distribution of estimated treatment length outcomes from the test set.

Task 3: Probability of Completing Predicted Treatment Course.

Data attrition prevailed after the first two sessions, only 27.5% of patients completed five or more assessments. The probability for completing the predicted treatment series in the training data was 0.164. The predictive accuracy of the XGBoost model in the test dataset was modest (AUC 0.655, PPV 0.238, NPV 0.879), but outperformed the baseline model (AUC 0.5, PPV 0.16, NPV 0.841). The most important variable in the model was the predicted treatment length (xgb_pred). The other variables were considerably less influential.

Fig. 4.
figure 4

(a) ROC Curve for XGBoost dropout prediction model, (b) ROC Curve for XGBoost completing treatment series prediction model.

Fig. 5.
figure 5

Task 3 – Probability of completing predicted treatment course (a) Overall 10 most important variable for predictions and (b) impact on global model of variable values

Task 4: Outcomes on NF Clinical Dimensions.

Each dimension and treatment length necessitated a unique training dataset for this prediction task, resulting in 108 training datasets (Appendix 2). Table 3 showcases the average treatment outcomes for all permutations of dimensions and estimated treatment lengths used as the baseline predictions. The average RMSE of all 108 XGBoost models was 1.23 (95% CI 1.15–1.31), which significantly outperformed the baseline models’ average RMSE of 1.391 (95% CI 1.3–1.49) with a p-value of < 0.05 (Welch Two Sample t-test). Appendix 2 provides detailed results of the validation of predictions in the test set. Overall, most outcomes did not appear to systematically vary with increasing treatment length. Patients who completed different numbers of sessions had similar outcomes on most NF dimensions.

Table 3. Mean outcomes (baseline predictions) for NF dimensions for patients with estimated treatment lengths of 5–12 in the training dataset. Patients with treatment lengths of 8 and 9, and 10 and 11 were binned. Estimated treatments lengths > 12 were binned as 12.
Table 4. Hyperparameters used for the final XGBoost models

Hyperparameters.

The resulting hyperparameters from the grid-search utilizing 10-fold cross validation varied only slightly between the models.

4 Discussion

We have delineated a methodology leveraging patient self-reported data alongside ML techniques to predict psychotherapy outcomes. Our aim was to elucidate how patients’ responses influence model predictions, employing methods drawn from XAI. Although the ML models derived from our four designated prediction tasks outperformed baseline predictions, the overall performance remained modest. While Bennemann et al. (2022) examined various ML algorithms and ensembles to predict patient dropout, emphasising maximising predictive performance, our focus has been on establishing methodologies that facilitate clinical implementation of predictive models. However, our findings align with those of Benneman et al. who reported a range in performance, (AUC – Area Under the Receiver-Operator Characteristic Curve) after training 21 single algorithm ML models on 77 variables to predict dropout, from 0.52 to 0.653, and for algorithm ensembles (using various stacking and variable selection methods), from 0,547 to 0,658. With a dropout prediction AUC of 0.624, our model outperformed 12 of the 21 models tested, and 13 of 30 ensembles, achieving this with a more parsimonious dataset and modelling methodology. In contrast to Benneman et al., who required patients to respond to more than 430 items to construct the model variables, our model only require patients to respond to up to 92 items. For most patients the number of items was lower due to the patient-adaptive nature of the data collection. This suggests that the NF assessment alone could suffice for predictive models with reduced patient burden.

To realise successful clinical implementation, perceived benefit from new technology use is essential. Consequently, fostering explanations that facilitate interpretation has been prioritised over optimising prediction performance. For therapists and patients, identifying areas where self-reports, coupled with ML, reveal risk of poor outcomes, can prompt discussions about mutual goals, thereby strengthening the patient-therapist alliance. In our data, the ML models highlighted NF items that concentrate on patient-therapist relations as crucial predictors.

The four prediction tasks outlined in our study could be instrumental in reducing patient attrition from treatment and enhancing patient outcomes. Our results do not provide therapists with definitive answers about patients at risk of poor outcomes but can enlighten therapists on areas needing particular focus to retain the patient in treatment.

Handling missing data was a critical component of our process, given the adaptive nature of the data-collection technology. Traditional approaches to missing data, such as eliminating incomplete data or imputing missing data, were untenable, thereby restricting our ML algorithm options. The Gradient Boosted Decision Tree algorithms are uniquely equipped to handle this issue with their inherent capacity for managing missing and correlated data. Our findings reveal that missing data, a byproduct of patient-adaptive data collection, carry predictive value.

Our study is not without limitations, most notably, the lack of clinical labelling of the dataset necessitated using the information contained in the dataset to label outcomes such as dropout. This knowledge gap regarding reasons for patient dropout may have led to mislabelling patients as dropouts when they, in reality, continued treatment but ceased further assessments. Future data collection will ideally include this information, potentially enhancing predictive performance when training new ML models. Moreover, we have solely used data from the first assessment for predictions. Including data from additional assessments and session changes might improve predictions, providing avenues for future research. Lastly, although we have concentrated extensively on using techniques and visualisations to boost explainability and interpretability, we have not yet obtained end-user feedback. We anticipate that further enhancements in presenting predictive model results can be realised via an iterative process involving end-users.

5 Conclusion

Our study indicates that Machine Learning (ML) models, when applied to self-reported patient data, can assist in predicting clinical outcomes with improved predictive performance over baseline models. Additionally, our findings highlight that a meticulously designed and patient-adaptive data collection method can minimise the required number of item responses, alleviating the burden on patients without compromising the efficacy of predictive tasks. We have also demonstrated how such predictive model results can be visualised in a clinical context implementing the principles of eXplainable AI (XAI). For successful clinical implementation of predictive models, it is imperative to provide end-users with a clear and understandable pathway from input data to recommendations to foster trust and understand ability. XAI provides a strategy to achieve this that will be essential for future work.

This is the first study that use Norse Feedback (NF) patient-reported data for predictive modelling. The implications of this study are noteworthy; the NF assessment, capable of integration into all levels of mental healthcare, could undergo further refinement to incorporate ML predictions, thus supporting therapists in making clinical decisions. However, embedding ML outputs in a clinical scenario will necessitate continued efforts to enhance predictive accuracy and to refine the representation of results, aimed at improving end-user understanding and interpretation.