Background

Leukemia is believed to be one of the most common and deadly known malignancies worldwide [1, 2]. It accounts for 4% of all malignancies and 4% of fatality rates emanating from malignancies [3]. Chronic myeloid leukemia (CML) is one of the most well-known forms of leukemia, accounting for 15 to 20% of all cases of leukemia. CML is a clonal myeloproliferative disorder (CMD) arising from acquired genetic alterations on the hematopoietic stem cells [4,5,6]. The chromosome abnormality known as the Philadelphia chromosome (Ph+) results from the fusion of the Abelson (Abl) tyrosine kinase gene at chromosome 9 and the breakpoint cluster (Bcr) gene at chromosome 22 (BCR-ABL fusion) [7, 8].

Today, there have been considerable advancements in leukemia management. Nonetheless, drug resistance, disease recurrence following treatment, cancer progression to advanced stages, disease prognosis, and survival prediction are of great significance [9]. Early detection of CML cases and active patient triaging help them evade the advanced stages of the disease and increase their survival chances [10, 11]. This requirement is more demanding since numerous clinical and non-clinical factors are involved in CML development [6, 10]. The decision about the best treatment path for an individual CML patient based on their specific clinical and demographic attributes and in the context of very effective treatment options is more multifaceted and often based on subjective evaluation. Furthermore, the existence of numerous disease severity levels and some uncertainty and ambiguities about the disease behavior and outcome further complicate the situation [12,13,14,15].

Conventional statistical methods provide forecasts without illuminating the meaning of the prediction or the associations amid numerous features that might influence patients’ survival. However, artificial intelligence (AI) technologies such as machine learning (ML) offers in-depth, effective, and non-invasive analytical capabilities over traditional statistical and experimental prediction methods in dealing with complex and ambiguous situations such as cancer outcome and survival prediction [16,17,18,19,20,21,22,23,24]. ML extracts comprehensible patterns and applied knowledge from large-scale raw datasets and thereby supports clinical decisions [25, 26].

So far, many studies have compared ML techniques for designing optimal and efficient clinical decision support systems (CDSSs) for the survival prognosis of patients with leukemia. Although a great number of studies have focused on acute myeloid leukemia (AML) [14, 25,26,27,28,29,30], CML [10, 31] has received little attention. Sasaki et al., suggested that identifying the most optimum and effective ML classifiers is necessary for improving therapeutic outcomes and increasing CML patients' life expectancy and survival [15].

Given the high incidence of CML in Iran and the lack of a reliable study to determine predictors of cancer survival using ML algorithms, the present study aimed to initially identify the most effective variables and feed them as data input into different ML techniques for a 5-year CML survival prognosis to assess their predictive power.

Methods

Study design

This was a retrospective and developmental study conducted in 2022 to predict 5-year survival in CML patients based on selected data-driven ML techniques.

Study setting

Our study was conducted in five main steps: data understanding, data preprocessing, feature selection, modeling, and evaluation. First, we aimed to recognize the most related variables to the 5-year CML survival prognosis and then use them as inputs for developing ML-based prediction models. To this end, we chose the most popular data mining method called cross-industry standard process (CRISP) to predict and diagnose CML. Figure 1 presents the proposed model of study steps based on the CRISP model. STATA and Python were used to provide descriptive statistics and data analysis. The scikit-learn library in Python was also used to implement the models.

Fig. 1
figure 1

The roadmap of the proposed system based on the CRISP methodology. SSVM support vector machine, RBF radial basic function, DT decision tree, KNN k-nearest neighborhood, XG Boost eXtreme gradient boosting, AUC area under the curve

Data understanding

The data used in this study were obtained from a database at the Abadan University of Medical Sciences collected from April 2016 to December 2018. The primary dataset contained the information of 1218 patients with CML. The patients would be included in the study only if they met all the following criteria: (1) The patients were diagnosed with CML; (2) their survival status (alive/deceased) was available in their records; (3) in terms of the time frame, we considered patients diagnosed between 2011 and 2016 to have adequate follow-up period (5 years or more) after the diagnosis; (4) the age of more than or equal to 18 years; patients aged under 18 years old should be included in the scope of pediatric exploration [32]; (5) records with missing values of less than 30%. Accordingly, from 1218 patient records, 173 records for patients who were aged < 18 years old were excluded. In the preprocessing phase, 208 incomplete rows of data (with missing data of greater than 70%) were removed. After these criteria were applied, a total of 837 patient records were deemed suitable for inclusion (740: survived within 5 years and 97: deceased within 5 years).

Study variables

Several variables were collected for CML patients in the EMR database. We checked the definition of the variables included in the data dictionary section of the database to completely understand the definitions of the data and the choice of proper variables. The criteria for selecting the candidate variables related to CML for survival prediction were based on consultations with expert oncologists and studying the relevant literature. Survival at 5 years and more was selected as the outcome variable. Survival is a continuous variable with units in months. Thus, we created a binary variable where any patient with a survival of 60 months or more was coded “yes”, or “no” otherwise. The following covariates were extracted based on the literature review coupled with experts’ opinions from the EMR database. Finally, a total of forty-five independent variables were utilized to predict the 5-years survival of patients with CML (dependent variable). These variables were categorized as demographic (two variables), history (seven variables), clinical manifestations (22 variables), and laboratory (14 variables) (see Table 1). After reviewing the patients’ variables, statistical analysis was performed to describe the differences in their features with the target variable (deceased within 5 years or survived). To this end, differences in demographic, history, clinical manifestations, and laboratory information of patients were described based on whether the patients with CML were deceased within 5 years or not, and the relationship of each feature with survival status was checked by the chi-square test.

Table 1 Baseline predictor variables

Pre-processing step

Data preprocessing is an essential step in the CRISP methodology to obtain an optimal, accurate, and beneficial dataset for further ML algorithms. In this study, many pre-processing methods on the dataset were applied before the training of the ML algorithms. Removal of missing values, standard scalar, Min–Max scalar, data validation, under-sampling, and splitting the dataset were examined to obtain an optimal dataset.

Feature selection

Data mining algorithms usually have difficulty dealing with a large number of input variables, which poses a serious challenge to researchers. These irrelevant variables diminish the performance of many ML algorithms. Thus, the selection of important variables is a major step during data mining [33, 34]. Comparison of data mining results before and after feature selection in many studies has shown an improvement in performance criteria through feature selection. Similarly, in the present study, this approach (before and after) was adopted [35, 36]. Here, the minimal redundancy maximal relevance (mRMR) feature selection algorithm was employed. This technique uses a heuristic method to select the most relevant variables. The heuristic search that is utilized in the mRMR technique chooses optimum variables that have maximum relevance and minimum redundancy [37].

Classification algorithms

To predict the survival chance of CML patients, eight ML techniques, including XGBoost, k-nearest neighborhood (KNN), pattern recognition network, probabilistic neural network, multilayer perceptron (MLP), support vector machine (SVM) (kernel = linear), SVM (kernel = RBF), and J-48 were employed. Although there are many supervised ML techniques, these particular models were chosen because they represent a range of modern and common methods in cancer research.

Performance evaluation of classification algorithms

The k-fold cross-validation method was utilized to assess the performance of the examined data mining models and to compare the results of the classification models. Cross-validation is a resampling technique applied for the evaluation of data mining techniques in an unseen data sample. In this method, the ML models are trained and tested k times Additionally, to compare the performance of classification models, the mean of evaluation metrics such as accuracy, specificity, sensitivity, kappa, and area under the curve (AUC) were used (Eqs. 15).

$${\text{classification accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\mathrm{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}{*}100$$
(1)
$${\text{classification sensitivity}} = \frac{{{\text{Tp}}}}{{{\mathrm{TP}} + {\text{FN}}}}{*}100$$
(2)
$${\text{classification specificity}} = \frac{{{\text{TN}}}}{{{\mathrm{TN}} + {\text{FP}}}}{*}100$$
(3)
$${\text{classification error}} = \frac{{{\text{FP}} + {\text{FN}}}}{{{\mathrm{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}{*}100$$
(4)
$${\text{f-measure}} = 2{ }\frac{{\text{precision*sensitivity}}}{{{\mathrm{precision}} + {\text{sensitivity}}}}$$
(5)

Ethical considerations

This study was approved by the Ethical Committee Board, Abadan University of Medical Sciences (code: IR.ABADANUMS.REC.1401.042). To protect the privacy of the patients and the confidentiality of the data, we concealed the unique identifying information of all the patients in the data collection process.

Results

Patient characteristics

Overall, 837 patients with CML met the predefined inclusion criteria. Of 837 eligible patients in our study, 545 (65.11%) were men and 292 (34.89%) were women, and the median age of the participants was 57.25 years (interquartile 18–100). Of these, 740 (88.41%) cases survived and 97 (11.58%) died. Table 1 provides a detailed description of all the variables.

The mean age of CML patients who died within 5 years was 60 ± 2 years old, and the mean age of the patients who survived within 5 years was 57 ± 1 years old (p value = 0.081). Table 1 shows that there was a significant association between some variables of patients who survived within 5 years or not: feature with p value < 0.005 which has a significant difference in patients who survived within 5 years or not. For example, the results showed that there was a significant relationship between elevated blast cell proportion and increased basophil count with the survival status of the patient with CML (p value = 0.042 and p value = 0.018, respectively) (see Table 1).

Selection of patient features

A set of 12 most important features to predict the 5-year survival of CML patients were selected based on a heuristic method. The selected features and their scores are ranked and represented in Fig. 2.

Fig. 2
figure 2

The most important variables selected by the mRMR

According to Fig. 2, spleen palpable, age, unexplained hemorrhage, sex, increased leucocyte count, elevated blast cell proportion, resistance to tyrosine kinase inhibitors, decreased platelet counts, anemia, malaise, and night sweats obtained the highest score for prediction of the 5-year CML survival.

Results of hyperparameter tuning

To use ML classifiers more accurately and sensitively, the RandomizedSearchCV method was used for parameter tuning and optimization models, including XGBoost, MLP, KNN, probabilistic neural network, pattern recognition network, SVM (kernel = linear), SVM (kernel = RBF), and J-48 decision tree classifiers (see Table 2).

Table 2 Best hyperparameters of all the trained algorithms

Results of k-fold cross-validation for the performance of classification algorithms

The tenfold cross-validation method splits the dataset into 10 parts and performs the holdout method 10 times. The algorithms were run for both the full and the selected features of the dataset. Accordingly, the dataset was randomly divided into training (70%) and test parts (30%) for all algorithms. First, the algorithms were trained using the training section and then validated using the test section to determine predictions. The dataset for both full and selected features was examined using eight classification techniques. Firstly, we trained and tested the ML techniques of the full-features dataset, and the second time, we fed the selected features to the ML classifiers. To compare the performance of classification techniques with a 95% confidence interval (CI), the average assessment metrics were obtained. Table 3 presents the results of eight classification algorithms based on the selected features to predict 5-year CML survivability.

Table 3 Performance evaluation of the selected ML algorithms

Table 3 represents the performance of eight ML techniques on the selected feature in 10 independent run times. According to the results, the performance of ML models on the selected variables dataset was higher than that of the full-features dataset. When both the selected and full-features datasets were separately fed into the MLP model, the MLP classifier obtained a mean accuracy of 77%, sensitivity of 72%, specificity of 76%, F-measure of 76%, and kappa statistic of 76.2% on the selected features. It also obtained a mean accuracy of 69%, sensitivity of 71%, specificity of 68%, F-measure of 70%, kappa statistic of 72.2%, and AUC of 70% on the full-features dataset.

The J-48 model was applied to both the selected-features and the full-features data set. As shown in Table 3, by feeding the selected features into the J-48 algorithm, an average accuracy of 83%, a sensitivity of 83%, specificity of 81%, F-measure of 77%, kappa statistic of 83.2%, and AUC average of 83% were obtained. The results of J-48 on the full-features dataset were obtained as follows: an average accuracy of 69%, a sensitivity of 69%, specificity of 68%, F-measure of 69%, kappa statistic of 61%, and AUC average of 69%.

As shown in Table 3, when the selected dataset was used, the XGBoost classifier obtained an average accuracy of 79%, a sensitivity of 78%, specificity of 76%, F-measure of 77%, a kappa statistic of 78.2%, and an AUC average of 76%. The results of the XGBoost classifier on the full-features dataset indicated an average accuracy of 69.3%, a sensitivity of 70.6%, specificity of 68.4%, F-measure of 69.2%, a kappa statistic of 62.1%, and an AUC average of 69.5%.

By using the selected-features dataset, the SVM (kernel = RBF) represents good results with an average accuracy of 85.7%, specificity of 85%, sensitivity of 86%, F-measure of 87%, kappa statistic of 86.1%, and AUC of 85%. Application of the SVM model to the full-features dataset also yielded an average accuracy of 69.7%, specificity of 69.1%, sensitivity of 71.3%, F-measure of 72%, kappa statistic of 75.2%, and AUC of 70.1%. The results of all experimented ML algorithms on the selected-features and full-features datasets are depicted in Fig. 3.

Fig. 3
figure 3

Comparison of ML models' performance on A full and B selected features

Based on Fig. 3a, b, the results obtained for SVM with RBF kernel on the selected features outperformed the other seven ML techniques, yielding 85.7% for average accuracy, 85% for specificity, 86% for sensitivity, 87% for F-measure metrics, 86.1% for mean kappa statistic, and 85% for mean AUC metrics. Based on the full features dataset, the decision tree algorithm obtained the highest performance for all the assessment criteria. The second-highest performance on the selected-features and full-features datasets belonged to J-48 and SVM with RBF kernel for the prediction of the 5-year CML survival. Finally, the worst result was obtained by MLP with a mean accuracy of 77%, a mean sensitivity of 72%, a mean specificity of 76%, a mean F-measure of 76%, a mean kappa statistic of 76.2%, and AUC of 76%. The AUC curve for the top eight ML algorithms and the classification report of the SVM with RBF kernel is depicted in Figs. 4 and 5, respectively.

Fig. 4
figure 4

The ROC curve for the top four ML algorithms

Fig. 5
figure 5

Classification report for SVM with RBF kernel

Discussion

This study retrospectively analyzed the data of 837 CML patients to develop an intelligent model for predicting the 5-year survival of CML patients. First, the most important variables affecting CML survival were identified using the mRMR feature selection technique. Then, the dataset with full and selected variables was fed into the ML models separately. Finally, the models’ performance was evaluated and compared based on the confusion matrix criteria. Initially, feature selection analysis was performed to select the most important variables. Among a total of 45 primary variables, a set of 12 variables (about 27%) including age, sex, spleen palpable, unexplained hemorrhage, increased leucocyte count, elevated blast cell proportion, resistance to a tyrosine kinase inhibitor, decreased platelet counts, anemia, malaise, and night sweats were selected as the most important predictors affecting CML. These variables are largely consistent with those in reviewed studies. The results of comparing the eight selected ML algorithms after feature selection showed that the SVM RBF achieved the highest performance in the 5-year survival prognosis of CML patients with an accuracy of 85.7%, specificity of 85%, sensitivity of 86%, F-measure of 87%, AUC of 0.85, and kappa statistic of 86.1%.

Feature selection is a key prerequisite for data mining, which reduces unnecessary data and improves the speed and efficiency of data mining [38]. Previous studies showed that numerous clinical and non-clinical predictors influence CML survival. In reviewed studies, after performing feature selection, a number of demographical and clinical manifestation variables such as age [27, 31, 39, 40], sex [14, 27, 40], body mass index (BMI) [10, 27, 30, 31], race [27, 40], body pain [14, 25, 27, 28, 40, 41], general malaise [10, 14, 27, 31, 40], fever [10, 26, 28,29,30], night sweats [25, 28,29,30, 41], unexplained hemorrhage [10, 26, 28, 30, 39], general infection [14, 31, 40], enlarged spleen [10, 14, 27, 31, 39, 40], cachexia [25, 27, 30, 31], anorexia [10, 14, 29, 31], and drug resistance [25, 27,28,29, 41] are determined as the most important predictors affecting CML survival. Besides, neutrophil/lymphocyte count [10, 14, 25,26,27, 29, 31, 39], lactate dehydrogenase (LDH) [14, 25,26,27, 29], complete blood count (CBC) [14, 25, 26, 28, 29], platelet count [10, 25, 26, 40, 41], peripheral blast count (PBC) [14, 25, 27, 29, 41], and red blood cell (RBC) count [14, 25,26,27,28, 39, 40] are considered as the paramount laboratory and evaluative variables of CML.

The results of our study are consistent with the studies by Eckardt et al.[28], Coombes et al. [41], and Orgueira et al. [14] In these studies, after implementing various ML models to predict the survival of patients with leukemia, the SVM classifier yielded the best performance. In a study by Karami et al., the SVM had better performance with 85.17% accuracy and 0.93% AUC for the survival prognosis of AML patients [30]. Taiwo et al. evaluated the performance of four ML algorithms to predict the survival of patients with CML. Their results showed that the SVM algorithm would present a better performance with 99.82% classification accuracy [10]. Furthermore, Chen et al. compared the performance of three ML methods for the survival prognosis of chronic lymphocytic leukemia (CLL) patients. Finally, the SVM model with 90% AUC exhibited the best performance [39].

Contrary to the results of the present study, in some studies such as Das et al. [27] and Hauser et al. [31], ensemble models performed better in predicting the survival of CML patients. Das et al. found that gradient boosting with an AUC of 0.77 could best help survival prognosis amongst the selected methods [27]. Hauser et al. also showed that ML technologies, in particular XGBoost and LASSO models, would help with active patient survival prognosis and prompt identification of high-risk CML cases for treatment improvement and care planning purposes (AUC range: 0.87–0.96) [31]. Hu et al. (2021) revealed that ML algorithms offered an effective predictive model for timely, effective, and economical identification and prognosis of leukemia and its survivability [40].

Limitations and implications

The proposed model is likely to accurately predict the 5-year survival of patients with CML. Hence, this can make the designed model applicable to real clinical settings. However, the present study faced some potential limitations. First, as we used a retrospective dataset, there were certain missing and noisy fields (e.g., incoherent, incomplete, abnormal, meaningless, and erroneous) that could have impacted the modeling process. Therefore, to deal with noisy fields, the normal range of each variable was defined using the opinion of two oncologists. Then, we specified all the values that fell outside the defined range (noisy fields) and completed them by referring to patient records or the responsible physician. In addition, the records with more than 70% of empty fields were removed and imputed by mean or mode values for continuous and discrete variables, respectively. Second, we dealt with a single-center dataset with a limited sample size that undoubtedly affected the quality of modeling, comprehensiveness, and generalizability of data. Third, the dataset did not collect data on economic status, lifestyle habits, molecular biology, genomic, proteomic, or metabolomic factors that may affect the survival of CML patients. The inclusion of these factors may increase the predictive power of the models. Furthermore, the dynamic nature of some variables would double the need to use systematic follow-up programs for a more comprehensive picture of the patient. Therefore, it is recommended that more studies be conducted after more accurate validations to improve the quality of modeling and minimize prognosis bias.

Conclusions

ML techniques as new, innovative, and non-invasive methods for the 5-year survival prognosis of CML patients will improve healthcare quality services, offer customized treatment, and reduce the serious complications and deaths associated with the disease. Therefore, we implemented and compared the performance of eight ML-based models for 5-year survival prediction of CML patients. After identifying the most important predictor variables (12 variables) and implementing the classification models, the SVM (RBF kernel) algorithm with an AUC of 0.856 presented the best performance. The algorithm proposed for effective identification of high-risk patients and predicting disease behavior and complications will effectively help medical experts to maintain treatment cost-effectiveness, prioritize resources, and improve safety and care quality. At the same time, it will improve patients' life expectancy. In future studies, our proposed models are expected to be customized to other malignancies and clinical areas. It is also recommended that more ML and even deep learning (DL) techniques be adopted for a profound and more reflective analysis, system user interface implementation, and system external validation in the real clinical environment. We expect our model to be further validated and probably re-optimized based on mixed datasets from multiple settings. While external validation is required, our developed model provides a basis to develop intelligent systems for CML disease.