Plain English summary

For a patient with advanced cancer the decisions made together with the medical team on the next steps in their cancer care can significantly impact a patient’s quality of life and their end-of-life care experience. End-of-life care discussions can be difficult to initiate because it can be difficult to accurately estimate when a person is nearing the end of their life. Currently, there are no data-driven patient-centered tools to guide end-of-life decision-making. Recognizing that this type of prognostic information may be valuable to the patient and the medical teams, we studied an innovative approach to prediction life-expectancy by training machine learning models to identify patients that were of high-risk short-term mortality using patient reports of symptom burden collected using validated questionnaires. Our machine learning model was able to reliably predict an individual patient’s risk of death in the next 180 days.

Introduction

For patients with advanced, relapsed, refractory cancers, the maintenance of quality of life emerges as a priority when faced with difficult treatment decisions about supportive care or consideration of clinical trials [1, 2]. While awareness of life-expectancy can help inform discussions with clinical teams and individual decision-making for patients and families, prognostication in the clinical setting remains a challenge with no standardized approach that is personalized to an individual patient in the setting of advanced cancer. Current paradigms of prognostication based on clinical experience remain suboptimal [3,4,5]. It is rare that physicians provide prognostic and when that information is provided the estimates are often inaccurate [4]. In a study on 343 doctors’ prognostic accuracy in survival estimations for 468 advanced patients, around 80% of predictions were overestimated or underestimated [6].

Previous studies into existing tools demonstrate limited value in providing oncology clinicians with accurate identification of those patients at risk of short-term mortality [6, 7]. This prognostic uncertainty coupled with the tendency of overestimating life-expectancy caused by systematically optimistic error [6] strains the patient-physician communications regarding life-expectancy estimates [8, 9]. Some existing prognostication tools (e.g., Nottingham Prognostic Index for breast cancer, Lung Cancer Prognostic Index) are cancer type-specific diagnosis aids, not suitable for all cancer patients [10, 11].

Accurate prognostication for patients with advanced cancer can help inform personal decision-making and align expectations and goals of care at the end-of-life. This clinical reality established the urgent need for safe and reliable mortality prediction to facilitate the timely discussions particularly on advance care planning [12]. Machine learning (ML) techniques have demonstrated some encouraging results in mortality predictions in general oncology populations but previous studies have not tended to incorporate patient-reported data [12,13,14,15,16,17]. For example, an artificial neural network has been shown to predict the 5-year survival of 125 non-small cell lung cancer patients with 87% accuracy [13]. Using electronic health record data, the developed ML model successfully distinguished the varied level of risk in 60-, 90-, 180-day mortality with the area under the curve range of 0.83–0.66 among patients with cancer [12].

Though previous studies have shown high accuracy when making mortality predictions using electronic health record (EHR) [12, 15,16,17,18,19,20], many such algorithms have struggled to produce a high level of sensitivity when identifying rare events and were not able to reliably identify people who did die based on their reported low sensitivity [16, 21, 22].

However, these algorithms which have unique ability in modelling complex nonlinear relationships between variables [20] may have expected to perform better when identifying patients who were at risk of short-term mortality. One possible limiting factor is the available signal within the datasets used, specifically the EHR, which serves as a log of procedures, encounters, and test results and less of a comprehensive picture of a patient’s health.

Patient-reported outcome (PRO) measures, in the form of standardized validated questionnaires, offer the ability to routinely assess a patient’s own perception of their health, functioning, and quality of life in a time- and cost-effective way. Though often not routinely collected and stored in the EHR, PRO measures may create flexible and actionable data which can be used to inform decision-making at the point-of-care as well as inform statistical and quality improvement investigations [23].

We hypothesize that by integrating patients’ own symptoms and health reports, an ML model can be identified for further prospective testing to ultimately provide key information on prognosis for the frontline clinical oncology teams, particularly in the context of future treatment planning and end of life. Findings from a survey evaluating the feasibility of ML model-derived mortality predictions in eliciting end-of-life conversations suggest that oncologists reasonably agree that advanced care planning conversations are appropriate for these patients classified by ML models as at the highest risk of death [20]. We hope that similarly, this data-driven, shared decision-making approach that we will create by applying PRO data into an ML model that can help inform the decision-making process, facilitate timely end-of-life discussion, and tailor personalized treatments to align patients’ goals and values. Therefore, the objective of this study was, using PRO data of symptom severity, to develop and test the performances of different ML algorithms in predicting mortality within 180 days in patients with advanced cancer.

Methods

Study sample

We queried a historical database of patients receiving care at our institution from February 2009 to February 2020 to identify a randomly selected cohort of patients with advanced cancers seen in the outpatient supportive care clinic setting. These dates reflect the period during which standardized PRO data were collected as a part of their routine care at our institution on this service. This cohort was used in the overall algorithm development and testing. The outcome variable was defined as mortality within 180 days following the clinical visit for patients with advanced cancer. The outcome variable was binary.

Measurements

The Edmonton Symptom Assessment System (ESAS) is an established validated, reliable PRO measure for assessing symptom severity experienced over the past 24 h by patients with advanced cancer [24, 25]. The ESAS has good “global” internal consistency (α = 0.93) [26]. The 12-item ESAS-FS includes 10 core symptoms (pain, fatigue, nausea, drowsiness, appetite, depression, anxiety, shortness of breath, wellbeing, sleep problems) as well as financial distress and spiritual pain (see Appendix A) [27]. Patients report the average severity of each symptom over the previous 24 h on a scale of 0 (not present) to 10 (worst). ESAS scores of 0, 1–3, 4–6, 7–10 denote the symptom level of none, mild, moderate, and severe, respectively [28]. ESAS scores greater than 4 indicate the symptom severity is clinically significant [29].

Three composite scores of the ESAS: (1) Psychosocial Distress Score (PSS, a measure of psychosocial symptom burden, sum of ESAS anxiety and depression scores), (2) Physical Symptom Score [PHS, a measure of physical symptom burden, sum of physical ESAS symptoms (pain, fatigue, nausea, drowsiness, shortness of breath, appetite, wellbeing, sleep)], and (3) the Global Distress Score (GDS, a measure of total symptom burden, sum of first 10 ESAS symptoms excluding financial distress and spiritual pain), were calculated for our cohort [30].

Statistical analysis

We collected demographic, clinical, and PRO data to train and evaluate models. A brief description of each candidate predictor is listed in Table 1. The race was purposefully omitted to prevent racial-biased algorithms that may disadvantage minority groups [31]. All candidate predictors had low missing rates, ranging from 0 to 2.7%. We used mean and mode imputations for numerical and categorical predictors, respectively. All algorithm development processes, performance evaluation metrics selection, and results reporting of multivariate predictive models are informed and strictly follow recent guidelines specifically designed for them [32, 33].

Table 1 Candidate variables for algorithm development

We randomly partitioned the data into training and testing sets using a 4:1 ratio and compared baseline clinical and demographic characteristics between them (see Fig. 1 of study design and flow). We fitted 7 widely used ML algorithms to the training set and evaluated model performance using the testing set. The ML algorithms included regularized regression such as regularized linear modeling (GLM), classification tree, K nearest neighbors(KNN), extreme gradient boosting (XGBoost) trees, multivariate adaptive regression spline (MARS), support vector machine(SVM), single-layer neural network (NNET) (see Appendix B). We selected these algorithms based on their promising performance in predicting other similar medical tasks in literature as well as previously published studies of our group [34,35,36].

Fig. 1
figure 1

Study design and flow

Before we fit algorithms to the training set, we prepared the data with several data pre-processing techniques to improve model performance. Specifically, we log-transformed and normalized all numeric predictors and converted each categorical predictor to a set of binary predictors using dummy coding, with each binary predictor representing a level of the original predictor denoted as 1 for true and 0 for false, and reference level (e.g., gender_female) will not be displayed. In addition, we filtered out predictors that had a mean absolute correlation with other variables over 0.9, to avoid multicollinearity.

We used Bayesian optimization approach to select the best hyperparameter values for each model [37]. To avoid overfitting, we trained our model with a 3 repeated tenfold cross-validation resampling approach. We then selected 4 algorithms with the best cross-validation performance and tested the algorithms in the testing set, as well as an unweighted voting ensemble using the outputs of the algorithm in the testing set [38]. To evaluate model performance, we calculated several widely used metrics. The metrics included overall accuracy (correctly identified patients as the observed results), sensitivity(correctly identified patients who were dead), specificity(correctly identified patients who were alive), positive predictive value(correctly predicted patients who were dead in predicted positive results), negative predictive value(correctly predicted patients who were alive in predicted negative results), and area-under-the-receiver-operating-characteristics-curve(AUROC). To account for performance differences, we compared AUROC values using 2000 bootstrap replicates drawn from the testing set and stratified for the outcome variable [39].

We assessed the outcome fairness of the top 4 classification algorithms across white and nonwhite groups based on the extensively studied statistical notion of fairness of equalized odds [40], which requires the true positive and false positive rates should be equal for all groups [41]. All the calculated values of outcome fairness were assumed to be negative, and a larger (closer to 0) value was fairer. In addition to standard performance metrics, we used several model-agnostic approaches to uncover how the models generated predictions using the data to provide additional insights for clinical practice and research. First, we identified the most important predictors across the top four algorithms using permutation feature importance analysis. We then calculated the Accumulated Local Effect (ALE) of the important predictors to reveal how the model outputs were varied by the values of the predictors on average [42]. We also used the SHapley Additive exPlanations (SHAP) [43] to obtain insights into model behaviors at the individual level and constructed the calibration plots of model probability against the observed event rates for the top 4 models, to assess their calibration. Additional analysis on misclassified patients was performed to explore the possible reasons behind this misclassification to better train the algorithms in the future. We conducted all analyses within the R Statistical software package Version 4.1.1 (See Appendix C).

Results

Participant demographics

Overall, 630 of 689 participants (mean age, 59.10 ± 13.18 years) were included in the starting cohort with 504 in the training set and 126 in the testing set. Most participants (n = 354, 56.19%) were female and over a third (n = 217, 34.44%) were Caucasian/white; 318 (50.48%) patients in the sample died within 180 days after the ESAS-FS assessment; 297 patients (47.14%) were enrolled in Phase I clinical trials. The means for ECOG at C1D1 and the number of chemo regimens were 1.69 ± 1.01 and 3.74 ± 2.61, respectively. No significant differences were found between the training and testing sets (see Tables 2 and 3). The internal consistency of the 12-item ESAS-FS measure was good (α = 0.80; 0.95%CI, 0.77–0.82). Means of fatigue (5.34 ± 2.83) and spiritual pain (1.17 ± 2.19) indicated they were the most and least severe symptoms items. 63.97% of patients had clinically significant fatigue symptom issues. The Pearson correlation among the 12 symptoms of ESAS was positive and weak (r range 0.10–0.65) (see Appendix D).

Table 2 Demographic information and clinical characteristic for survey participants
Table 3 Demographic information and clinical characteristic for included patients in training and testing sets

Algorithm performance in the training and testing sets

The number of included variables ranged from 3 to 17 in the model training. After training the seven candidate algorithms using tenfold cross-validation on the training set, the best hyperparameters were selected through the tuning process for each algorithm. We summarized the coefficients of regularized regression with elastic net penalty in Table 4 to explore the influence of each predictor on the outcome due to the GLM model’s characteristic of intuitive interpretability. Of these included predictors, age (βregularized, − 0.05), gender (βregularized, 0.18), ECOG at C1D1 (βregularized, 0.16), number of chemo regimens (βregularized, − 0.01), pain (βregularized, 0.11), shortness of breath (βregularized, 0.20), appetite (βregularized, 0.13), well-being (βregularized, 0.04), PHS (βregularized, 0.11) were identified as key predictors and significantly associated with the prediction of mortality status for advanced cancer patients.

Table 4 Factors associated with 180-day mortality prediction upon regularized regression with elastic net penalty in training set

Single hidden layer NNET demonstrated the highest AUROC (0.659) during the tenfold cross-validation process, followed by GLM (0.656), SVM (0.655), and XGBoost (0.655) (see Table 5).

Table 5 Best hyperparameters selected for top four algorithms and its performance in training set

The eight most common and important predictors across the top four algorithms were age, appetite, ECOG at C1D1, gender, pain, PHS, number of Chemo regimens, shortness of breath. For the XGBoost algorithm, the top important variable was shortness of breath (see Fig. 2), which was the same as in the NNET algorithm.

Fig. 2
figure 2

Variable importance for the XGBoost algorithm

The ALE profiles for the most important variables in the XGBoost algorithm indicated that ECOG at C1D1, pain, shortness of breath, and wellbeing were positively influencing the prediction of 180-day mortality on average, while age, financial distress, number of Chemo regimens were negative.

Therefore, the top four algorithms (NNET, SVM, XGBoost, GLM) with the best AUROC in the training set were selected and applied to the testing set. Their performances and the algorithm ensemble are presented in Table 6.

Table 6 Prediction performance of 180-day mortality for top four algorithms in testing set

Of the four algorithms, the XGBoost performed best in five of six model metrics except for sensitivity. Specifically, this algorithm correctly identified that 44 of 65 advanced cancer patients were dead following the ESAS assessment within 180 days (sensitivity = 0.68, 95%CI, 0.56 to 0.79), which was the same as sensitivity in SVM and GLM. It had achieved the highest overall accuracy of 0.65 (95%CI, 0.56 to 0.73), specificity of 0.62 (95%CI, 0.50 to 0.75), PPV of 0.66(95%CI, 0.54 to 0.77), NPV of 0.64 (95% CI, 0.52 to 0.77), and AUROC of 0.69 (95%CI, 0.60 to 0.78) displayed in Fig. 3. Results of AUROC values comparison indicated that XGBoost performed significantly better only than the SVM (p = 0.04).

Fig. 3
figure 3

ROC curves for all algorithms and the ensemble

The racial fairness evaluations for these four algorithms were best both in XGBoost (− 0.08) and SVM (− 0.08) and worst in the GLM (− 0.24).

Calibration plot in Fig. 4 indicats these models were well-calibrated as average predicted probabilities on X-axis mostly matched the ratio of positives on Y-axis. Figure 5 reveals the varied contribution of each included variable made to the 180-day mortality prediction after ESAS assessment for 4 specific cases of the XGBoost algorithm using SHAP values. Three-fourths of these presented cases had been correctly predicted. In addition, the shortness of breath predictor dominated the contribution in predicting death outcomes.

Fig. 4
figure 4

Calibration plots for all algorithms

Fig. 5
figure 5

Example SHAP values for four individual predictions

Results of comparative analysis for misclassified patients

Table 7 shows a comparative analysis between correctly classified 82 (65.1%) and misclassified 44 (34.9%) patients by XGBoost model. These misclassified patients had higher mean age (59.93 vs 57.61 years), mean ECOG at C1D1 (1.91 vs 1.65), and mean of the number of chemo regimens (4.46 vs 3.34). However, no statistically significant differences were found between these groups in demographic information and clinical characteristics except mean of the number of chemo regimens (p = 0.03).

Table 7 Comparison of demographic information and clinical characteristic for misclassified and correctly classified patients in testing set for XGBoost model

Discussion

Main findings

Our research found that ML-based predictive models using patient, clinical, and the PRO measure scores of ESAS-FS showed promising performance in predicting the 180-day mortality risk of patients with advanced cancer. The XGBoost algorithm demonstrated the best performance, with an overall accuracy of 0.65, sensitivity of 0.68, and an AUROC of 0.69. The promising results of supervised machine learning models yielded in this study might help elicit timely conversations between oncologists and patients regarding how to navigate the patients’ symptom trajectory toward the personalized and data-driven based optimal treatment plan during their end-of-life period. Previous studies indicated that these end-of-life discussions can help with reducing the health care costs, avoiding unnecessary aggressive care, and ultimately improving the quality of death [44,45,46,47].

Currently, the downstream clinical actions after ESAS assessment were not sufficient, although its importance and usefulness are highly recognized by physicians and oncology professionals [28, 48, 49]. The successful application of the ESAS symptom data into the ML algorithm-based predicting models may substantially promote the proper clinical actions to be taken following the symptom screening [28]. This will fully leverage its meaningful impact in supportive and palliative care, which will benefit cancer patients eventually.

Furthermore, our study highlights the strength of utility of PROs in oncology short-term mortality prediction based on ML models. Although the predictive performance of these algorithms could be improved a little bit by including more covariates associated with mortality from clinicopathologic, tumor entity, comorbidity, and prior treatment information, our focus of this study is to mainly assess the impact of PRO measure ESAS on the mortality prediction to facilitate patient-centered care for advanced cancer patients. Results of this study indicate that high-signal information contained in PRO symptom measure may provide important benefits to prediction models for patients with advanced cancer. Over recent years, a growing recognition is that those complex algorithms and multidimensional datasets are not the only prerequisites of getting effective ML models, therefore, the quality of data is far more important than the quantity of data to be used in the training process, which determines the performance as well as the generalizability of generated models [50]. This is especially evident in health care where this big EHR data may not contain very detailed information of a patient’s health at a given time.

Of note, we observed that the ensemble approach did not achieve the best performance in identifying the 180-day mortality of advanced cancer patients following ESAS assessment, which contradicts some previous research that best performance achieved in pattern identification tasks using the voting ensemble approach [34, 51]. In this study, we adopted the traditional unweighted bagging technique to perform the ensemble algorithm. Previous studies found that this traditional ensemble technique lacks the ability to capture the similarity among trained objects and the new objects to be predicted (such as advanced patients in this study), which therefore weakens its predictive capacities [51]. This may provide an ambiguous explanation why ensemble learning did not perform better than a single learning algorithm in this study.

Compared to the included 630 eligible patients, all 59 lost to follow-up patients had taken part in a phase 1 trial. In cancer, phase 1 trials mainly test the effectiveness of the new drug on enrolled advanced cancer patients on whom the standard therapy does not work anymore [52]. Furthermore, these missed 59 patients had a greater mean of the number of chemo regimens (4.98 vs 3.74). This may explain their reasonable absence in this study due to their more serious illness.

Regarding the model performance, XGBoost descriptively showed the highest AUROC value and largest equalized odds outcome fairness property (-0.08). It performed significantly better compared to the SVM but not compared to the GLM and the NNET. The prediction of the overall accuracy of 0.65 for the XGBoost model is not optimal and still has much more room to improve. However, not only model performance but also model generalizability is important for ML applications: The XGBoost model showed similar performance in the tenfold cross-validation process and in the separate testing set and is situated in the middle of the GLM and the NNET regarding model performance and complexity. Thus, we believe that the clinical feasibility of the XGBoost model should be evaluated in future research. The XGBoost algorithm had misclassified 44 patients, of which, 21 dead patients were not accurately identified out. Results indicated that compared to the 82 correctly classified patients, these 44 misclassified patients were more likely to be male (56.82% vs 41.46%) and much older (59.93 vs 57.61 years), and have a higher mean of the number of chemo regimens (4.46 vs 3.34) and mean ECOG at C1D1(1.91 vs 1.65), but lower percentage of dead patients (47.73%vs 53.66%). Statistic from National Cancer Institute (NCI) states that the cancer mortality rate of men (189.5/100,000) is higher than that of women (135.7/100,000) in 2020 [53], and the median age of 66 years in cancer diagnosis reflects the increasing age is the most important risk factor for cancer overall [54]. Eastern Cooperative Oncology Group(ECOG) as a performance status scale represents the patients’ level of functioning by ordinal ratings of 0(healthy) to 5(deceased), which is utilized by an oncologist to assess the patients’ functional status as well as to determine patients’ eligibility for certain clinical trials [55, 56]. Most death is caused by the disease progressing in the palliative setting [57]. Vasconcellos et al. argued that advanced cancer patients with poor ECOG performance status had short survival after treatment associated with inpatient palliative chemotherapy [58]. This misclassification may be attributed to the complexities of interaction among predictors.

For the variable importance, shortness of breath feature, age, and appetite features have dominated Fig. 2 the variable importance of the XGBoost trees. Results in this study showed patients died in 2.08 months after the ESAS assessment on average, and people who were accurately predicted dead by the XGBoost algorithm died within 2.05 months on average following the ESAS assessment. We might see that the XGBoost trees algorithm was prioritizing people who are at immediate risk of death and therefore could be re-branded to a shorter timeframe. This is consistent with Seow et al.’ findings that appetite and shortness of breath of ESAS symptoms worsened overtime in the last 180-day of life for cancer patients [59]. Each symptom played varied levels of importance illustrated by the SHAP values in the mortality prediction for four specific cases in Fig. 5. Nevertheless, the higher the ESAS symptom burden is, the shorter the survival time for patients with advanced cancer [60,61,62].

Limitations and future study

The study comes with several limitations. First, the model developed in this single institutional study needs to be replicated using data from other clinics and hospitals to better ensure its representativeness and enhance its reliability. Second, here, we measured symptom severity only once using ESAS-FS. The impact of symptom severity change over time on outcome variables in the target population has not assessed yet. Therefore, longitudinal investigations on this topic are clearly warranted to complement the deficiencies caused using the cross-sectional survey in this study. Third, ML algorithms are focused on accurately predicting outcomes rather than making causal reasoning [63]. Causation refers the relationship between A and B that satisfies both sufficient cause and necessary cause [64]. Randomized controlled trials, regression-discontinuity methods, and interrupted time series are utilized for assessing causal inference in clinical medical research [65]. ML-based algorithms are good at finding correlations in data, but lack of reasoning about causality or environmental changes. Thus, we note that the analysis results do not necessarily explain the complex relationships predictors and outcome, and a theoretical model is needed to understand causality in future study. Fourth, this research has not yet been applied to clinical practice. Therefore, the effectiveness and reliability of these ML-based models in predicting 180-day mortality in an actual clinical setting still need to be further tested in practice. Fifth, although relative small sample size (n = 630)in this study is far lower than the minimal number of 200 events for each candidate predictors recommend for stable performance and reliable assessment for ML modeling approaches [66, 67], the calibration plots showed that predict probabilities were approximately close to most observed actual events rates, and the performed AUPROC comparisons evaluated the performance differences among them, further studies with larger sample size still are needed to better verify the findings.

Conclusion

In this study, we trained seven ML-based models for 180-day mortality prediction for patients with advanced cancer using PRO scores collected by ESAS-FS measure, and mainly evaluated the performance of the top four models in the testing set, of which, XGBoost trees achieved the best prediction results. This research will facilitate patients to make informative and reliable decisions to ensure that end of life care that meets their goals and wishes is provided through the timely conversation between oncologists and patients. Furthermore, the preliminary investigation by employing PRO ESAS into ML-based models for short-term mortality prediction may set up a benchmark for further researchers of interest to continue exploring its effectiveness in this field.