Background

In their last year of life, individuals with advanced cancer face costly and over-medicalized care, high unaddressed needs, and decreasing quality of life [1,2,3,4,5,6,7,8,9]. An early palliative care approach is essential to improve end-of-life outcomes, including symptom management, psychoeducation for patient and caregiver empowerment, and advance care planning [10,11,12,13]. Yet, many patients with advanced cancer in the real-world setting may either not receive palliative care, or receive it late into their disease trajectory [14,15,16,17]. Given workforce limitations, one proposed approach would be to use short-term mortality as a surrogate for identifying patients with high probability of palliative needs and most likely to benefit from palliative care [18,19,20].

Machine learning models trained on Electronic Health Record (EHR) data have shown promise in cancer prognostication, where advanced computational techniques are used to model linear and non-linear patterns within big datasets [19, 21]. The ability to leverage on routine data is attractive as it avoids burdensome external data entry and workflow disruptions. However, there remains several gaps within published literature.

First, while many published cancer prognostic models show promising discriminatory performance, the majority had high or uncertain risk of biasness, with incomplete reporting of modelling processes, and selective reporting of performance metrics [21, 22]. Specific to performance metrics among general oncology models, most models demonstrate low positive predictive value (0.45–0.53) and sensitivity (0.27–0.60), underperforming at the actual task of identifying patients who would die [21]. Second, alignment of model development strategy with articulated use-case is also critically missing in literature [23,24,25,26]. For example, some oncology prognostic models were developed on all-stage cancer cohorts despite the proposed use-case of increasing palliative care interventions. This fails to account that clinical implications and actions between early and advanced stage cancers can be vastly different when provided with a prediction of short-term mortality [27,28,29]. Third, if a model is designed for use as a clinical decision support system, reporting the model without intuitive explanations to model performance can negatively impact trust and adoption at clinical implementation [30]. In addition, complex models with automated feature selection and engineering may generate largely non-interpretable predictions [31].

This manuscript addresses gaps highlighted above. We aimed to develop and validate an explainable machine learning model trained on EHR data of advanced cancer patients, predicting for risk of 365-day mortality. Envisioning model output to nudge clinicians towards a palliative care approach, we aimed to enhance model interpretability by leveraging on prognostic literature and domain knowledge for feature engineering [32]. Systematic reporting of this study follows the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for prediction model development and validation [33].

Methods

Study design

We identified our cohort from patients with advanced cancer registered with the National Cancer Centre Singapore (NCCS). NCCS maintains a cancer-specific data repository with human-in-loop processes that registers cancer diagnosis and stage for each newly diagnosed patient. For each patient, data spanning 1st July 2016 to 31st December 2021 was extracted from the MOSAIQ Oncology Information System and SingHealth’s Enterprise Analytic Platform (eHints), which are unified data repositories that combine data from various healthcare transactional systems [34].

Participants

Our cohort consisted of adults (age ≥ 18) diagnosed with Stage 3 or Stage 4 solid organ cancer between 1st July 2017 to 30th June 2020. To allow sufficient data for prediction, these patients were required to have at least two outpatient encounters within NCCS between 1st July 2017 to 31st December 2020. Non-residents were excluded from the cohort as their mortality outcome were not accurately reflected in local databases.

Problem framing

We framed our classification problem to match the use-case: “Given a prediction point that corresponds to an actualized outpatient cancer visit, predict for mortality within 365-days from prediction point, using EHR data up to 365-days prior.” (Fig. 1a) This prediction point effectively divides any patient’s EHR timeline into past events and a virtual future. To allow baseline data to be available for prediction, we restricted predictions to the 2nd outpatient visit and beyond. Patients were allowed more than one prediction point to capture their disease and treatment trajectory over time. To reduce over-training on samples with clustered visits, we only allowed one prediction point per month for those with more than the median number of outpatient visits (Fig. 1b).

Outcome

Outcome was 365-day mortality from prediction point. Mortality date was obtained from the Singapore Registry of Births and Deaths and censored by 31st December 2021. Outcome was assumed complete as death registrations are mandatory by law.

Data pre-processing

Oncologists, palliative specialists, and data scientists were involved in feature selection and engineering. Our data included 5 categories of data commonly available within EHR and clinically relevant to prognostication: (1) Demographics; (2) Clinical Characteristics; (3) Laboratory and Physical measurements; (4) Systemic cancer treatment; and (5) Healthcare visits. (Additional File 1: Table S1 and Table S2)

To derive features on systemic cancer treatment, we extracted dispensed drug data and mapped them to the World Health Organisation (WHO) Anatomical Therapeutic Chemical (ATC) classification. The WHO ATC classification is a system of alphanumeric codes developed for the classification of drugs in a hierarchy with five different levels. Subgroup L01 (with its subcodes) are antineoplastic agents, while subgroup L02 (with its subcodes) are cancer endocrine therapies [35]. We categorised cancer treatments under subgroups of “L01A, L01B, L01C, L01D, L01E, L01F, L01X, L02A, L02B, and Trial drugs [35]. Additionally, we generated cumulative counts of unique cancer drugs as a surrogate for change in cancer treatment line as this tends to portend poorer prognosis.

To derive comorbidities, we extracted International Classification of Diseases, Ninth and Tenth Revision (ICD-9 and ICD-10) diagnosis codes and transformed them into Elixhauser diagnosis categories using R package ‘comorbidity’ version 1.0.5 [36]. To represent laboratory test results and body mass index (BMI), we summarized data with minimum, maximum, median, standard deviation, and latest available reading [37]. Engineered features such as healthcare utilization count as well as elapsed time from diagnosis were computationally derived.

Missing data handling

Longitudinal EHR data is often sparsely distributed, irregularly clustered, and incomplete [38]. Missingness within EHR data is “not missing at random” (NMAR) as the probability of missing data could be linked to disease severity, healthcare use, or a lack of clinical indication to collect the data [39]. Missingness is informative and should be incorporated within the modelling [40]. Boosted tree models such as XGBoost can handle missingness in features directly, as it is able to branch directions for missing values learned during training by itself (sparsity-aware split finding) [41]. Additional File 1: Table S2 provides a summary on missing data.

Statistical analysis and modelling

We developed the boosted tree model on Python version 3.9.16 using XGBoost (xgboost version 1.7.5). The data was split with ratio of 75:25 data for training and validation. Area under the receiver operating characteristic curve (AUROC) was used as the primary performance metric, as it reflects trade-off between sensitivity and specificity. Because AUROC is misleadingly high in datasets with class imbalance, we reported the area under the precision-recall curve (AUPRC) as it measures trade-off between positive predictive value and sensitivity [42]. The calibration plot and Brier score were used to compare predicted vs. observed rates of 365-day mortality [43]. To explain model output, we used Shapley Additive Explanations (SHAP) values (shap 0.41.0), a model-agnostic methodology that improves transparency and interpretability of machine-learning models. SHAP values are based on a cooperative game theoretic approach, where contribution of each feature towards a prediction is calculated by comparing changes in the prediction, averaged across all possible combinations of input features [44, 45]. The model agnostic explainer used was TreeSHAP, which leverages on the structure of trees to approximate the Shapley values for each feature while providing feature attribution scores for predictions made by tree-based models [46].

Results

A total of 5926 patients with 52,538 prediction points were included in this study. (Additional File 1: Figure S1) To prevent data leakage between training and validation sets, the 75 − 25 split was carried out at patient-level. The training cohort consisted of 39,416 prediction points among 4444 patients, while the test cohort consisted of 13,122 prediction points among 1482 patients.

The mean age of our population was 66.3 (Standard Deviation [SD] 11.5) years with 64.2% being male and majority (84.3%) of Chinese ethnicity. A total of 3725 patients (62.9%) had stage 4 cancer while 2201 patients (37.1%) had stage 3 cancer. By censor date of 31st December 2021, 3316 (55.6%) patients in the cohort had demised. (Table 1) In total, 17,149 of the 52,538 prediction points (32.6%) had a mortality event within the 365-day prediction window.

Table 1 Study population characteristics (n = 5926)

Model

Model performance metrics on the validation cohort are reported in Table 2. The confusion matrix and model parameters can be found in the Additional file (Table S3 and S4 respectively). Set at a default classification threshold of 0.5, our model achieved an Accuracy of 0.781 (95% CI 0.774–0.788), AUROC of 0.861 (95% CI 0.856–0.867) and AUPRC of 0.771. In terms of model calibration, the Brier score was 0.147 with slight overestimations of 365-day mortality risk (calibration plot shown in Additional file 1: Figure S2).

Table 2 Performance metrics of XGBoost Model

Explainability

Figure 2a provides a summary ranking of the topmost data features (from highest to lowest SHAP values) within the model. The model itself considers all features and SHAP values can be calculated for all features. However, we show only the top 10 features for the sake of brevity. The top 3 impactful data features are the latest albumin value, stage 4 cancer on diagnosis, and unique number of cancer drugs given.

Figure 2b shows the interaction between value of each feature and its impact on model prediction. Similarly, we illustrate the top 15 features. The values for numeric features are normalized and represented along a colour gradient with red for larger value and blue for smaller value of the feature. The values for categorical features are similarly represented with red for present (value = 1.00) and blue for absent (value = 0.00). Within each feature, the line is then visualized by plotting individual-coloured dots that represents each prediction along its SHAP value (x-axis). A negative SHAP value (extending to the left) indicates a reduced probability for mortality while a positive SHAP value (extending to the right) indicates an increased probability of mortality. For example, we find that the lower the albumin value, the higher the probability for mortality (the y-axis line extending to the left is mostly red while the line extending to the right turns increasingly blue). Predictions with stage 4 cancer are associated with a higher probability for mortality, as they cluster towards the right side of the y-axis line.

Discussion

In this study, we trained and validated an XGBoost model using structured EHR data of advanced cancer patients. The model performed with excellent discrimination (AUROC 0.861), precision-recall (AUPRC 0.771), and accuracy (0.781) in predicting for the last year of life. Comparing against most similar published machine learning models in general cancer cohorts, we report a similar AUROC (0.812–0.890) and much higher AUPRC (0.340–0.462) [27, 28, 47, 48]. A high precision-recall is important to identify the few patients that will die within a year without overestimating the risk of death for the majority of patients who will actually survive, especially within resource-limited settings [42].

From the outset, we framed this AI development as a clinician decision support tool where predictions of high-risk mortality within 365-days may nudge clinicians towards considering involvement of palliative care, earlier anticipatory care discussions, as well as re-assessing the risk-benefit ratios of standard-of-care next line therapies. Hence, model interpretability is essential for user adoption and acceptance [49]. Eschewing the practice of a completely data-driven approach to feature development, we instead leveraged on domain knowledge of oncologists and palliative specialists in feature design to help with subsequent interpretability [32, 50]. For example, we recognise that disease control rates drop and risk of disease mortality increases with change in lines of cancer treatment [51]. Hence, an engineered feature of cumulative counts of unique cancer drug as a surrogate for cancer treatment line change was added, which became the third most important feature within our XGBoost model (Fig. 2a). Another example is where we incorporated strong literature evidence that elevated Neutrophil-Lymphocyte ratio (NLR) is associated with poor prognosis, and engineered features around NLR instead of providing raw neutrophil and lymphocyte data to the model [52]. This feature is the fourth most important (Fig. 2b) where higher NLR values are associated with increased probability for mortality. Our approach of developing explainable models with engineered features that comport with literature and clinical knowledge resonates with the clinician’s own intuitive understanding of prognostication and may increase model adoption [53].

Beyond global interpretability for a “black-box” machine-learning model, we have taken a next step by providing individual prediction explanations. Commonly, a binary classification model requires set probability threshold (set at 0.5 in our model), yet a patient with predicted probability of 0.49 may not be necessarily different in terms of risk from a patient with predicted probability of 0.51. Instead of using binary mortality prediction as a strict rule, we feel that visualizing predicted probabilities with model explainers will provide better clinical decision support for further clinical evaluation and interventions. Figure 3a and b shows the composition of individualized predictions for a 76-year-old Chinese gentleman with T3N0M1 lung cancer and comorbidities of hypertension and diabetes. “E[fX]]” denotes the average predicted probability of 365-day mortality for our entire cohort without considering any data features. “f(x)” denotes the final predicted probability of 365-day mortality after summing up all the feature contributions. Read from the bottom up, each data feature either increases (red arrows) or decreases (blue arrows) the probability of 365-day mortality additively. In Fig. 3a, this prediction was done 23 days post diagnosis, where among other features, he had normal albumin (41.0 g/dl), low neutrophil-lymphocyte ratio (1.46) and healthy body mass index (22.8). The model predicted patient to have a 31.6% risk of dying in the next 365-days, which turned out to be a true-negative prediction. In Fig. 3b, this prediction was done 505 days post diagnosis on the same patient, where patient’s albumin remained normal (41.0 g/dl), but being older, having received 4 different anti-cancer drugs, and having a higher neutrophil-lymphocyte ratio (6.41) increased his probability of mortality. The model predicted patient to have an 75.2% risk of dying in the next 365-days, which turned out to be a true-positive prediction.

Our model shows potential for clinical implementation in the cancer outpatient setting. The model output can be used in several ways. First, regular reports on identified outpatients can be provided to a back-end triage and case-management system. By proactively reaching out to these at-risk patients and offering regular palliative needs screening, issues can be identified and managed promptly. Second, model explanations and prompts can be sent to oncologists to increase their prognostic awareness, nudge them towards early anticipatory care planning, and reassess the risk-benefit ratios of next-line therapies. Third, the ability to identify the ex-ante end-of-life cancer cohort aids targeted study, formulation of healthcare policy, and prospective outcomes tracking around this at-risk group.

This study has several limitations. First, the model was trained and validated within a single centre advanced cancer cohort, and external validation will be needed to determine generalizability. Second, because cancer treatment continues to evolve rapidly, temporal validation is needed to determine performance drift over time. Third, algorithmic fairness will also need to be ascertained in subsequent work by validating performance within key demographic subgroups (e.g. age groups, ethnicity, and gender) [54]. Fourth, our model was trained on advanced cancer patients on diagnosis and does not include patients with early staged cancers on diagnosis with subsequent metastatic relapse. Identification of metastatic relapse is lacking even in established cancer registries like the Surveillance, Epidemiology, and End Results (SEER) Cancer database, and this is a problem that needs to be solved before any model can be used for patients with metastatic relapse [55]. Fifth, the model relies on processed EHR data obtained from institutional data repositories. Future model deployment will require access to these same data repositories and platforms, instead of direct implementation within the operational EHR environment. Lastly, as an AI tool for clinical decision support, performance metrics itself may not translate to real-world results, if clinicians do not act on the prediction, or resource limitations reduces the number of at-risk patients who can receive interventions. With recent national focus on end-of-life care within population health, we envision that palliative capacity and capabilities will be bolstered to meet the needs of these additionally identified patients [56]. In addition, we are exploring in-silico net-benefit analysis to study impact of the model on clinical outcomes based on simulated scenarios [57].

Conclusions

We have developed a prognostic tree-based model using structured EHR data, which possesses satisfactory discrimination and precision-recall capabilities. Our model development approach places emphasis on problem framing, feature hand crafting using domain expertise, and interpretable outputs aimed at both global and individual level prediction. While the model performance provides sufficient evidence in its use-case, further external validation is needed to confirm its robustness for real-world implementation. Further work is planned to conduct a prospective multi-centre validation study to simulate our envisioned use-case by handling actual data volumes of cancer outpatients weekly, allowing us to ascertain the model’s operability and efficiency within a real-world situation. Ultimately, this will enable us to refine and validate an AI solution that enables systematic ex-ante identification of cancer patients at-risk of mortality, with proactive palliative interventions triggered for the said individual.

Fig. 1
figure 1

a: Framing of the risk prediction problem. b: Sliding window of prediction points along the timeline for a single patient

Fig. 2
figure 2

a. Bar summary of top 10 data features within the model. b. Feature plot summary of top 10 data features within the model

Fig. 3
figure 3

a. Individualized prediction for a true-negative case. b. Individualized prediction for a true-positive case