Codes and findings of feature selection experiments are available in the github repository [25]. The uploaded file in the repository is currently protected with a password, ‘colorectalai’.
Data
Records of a prospectively maintained colorectal cancer database by the Colorectal Department in a large NHS Trust were examined. The dataset contains 4336 patients who underwent colorectal cancer surgery between 2003 and 2019. The 47 patient parameters/variables included demographics, peri- and post-operative outcomes, surgical approaches, complications and mortality (see Table 1). Table 2 shows various classes associated with variables in the dataset. The table shows the four different classes of procedure, 17 different classes of specific procedure, two classes of laparoscopic type, surgical approach (open or laparoscopic), robotic (yes/no), operation mode (elective or emergency), and complications (yes/no). Both intra- and post-operative complications are considered together.
Table 1 Available variables in the dataset Table 2 Classes associated with variables in the dataset Descriptive statistics of some variables that summarise the central tendency, dispersion and shape of the distribution of a dataset, excluding NaN values, can be seen in Table 3. There were 2494 male and 1942 female patients in the dataset. Among these 4336 cases, 74% (3209) were curative, 13.35% (579) were palliative, and 6.53% (283) were uncertain. 80% (3475) of the surgeries performed were elective in comparison to 18% (782) of emergency. Assistance from the robot was considered for 8.9% (388) of the cases. The laparoscopic approach was applied to 57.45% (2491) of the cases, whereas open surgery was used in 35.79% (1552). The 30-day readmission rate was 7.4%. Moreover, the 30- and 90-day mortality was 3.39% (147) and 5.93% (257) respectively.
Table 3 Statistics of selected variables from the dataset Data processing
The problems of missing values and mixed data types were dealt with using appropriate techniques and with the help of medical domain knowledge through discussions with clinicians. Following clinical discussion, missing values were filled with different techniques (see Table 4). The dataset consists of some columns where data types are mixed. For example, Sex and TumICD10 variables have both text and numeric data. In order to fit them to machine learning algorithms, all these mixed data types are converted to numeric data using the Pandas Series.str.replace() method [26]. Moreover, there are some columns that comprise text values only (e.g., Robotic, Radiotherapy). All these columns that consist of text data are passed through the LabelEncoder methods of scikit learn [27] to convert them to numeric data.
Table 4 Alternative techniques to fill missing values with medical domain knowledge Prediction model building
Model for LOS
Regression predictive modelling was performed to predict the LOS. The data were power-transformed to make them more Gaussian-like [28]. Then the data were discretized to map numerical variables onto discrete values. Such mapping creates a high-order ranking of values that can smooth out the relationships between observations and is found useful for machine learning [29]. A 10-fold cross-validation technique [30, 31] was used for splitting the training and test data. Different algorithms were compared to find the optimal model for LOS prediction. A negative mean absolute error was used as the evaluation metric to compare different algorithms. Comparison between algorithms shows that support vector regression (SVR) outperformed the other algorithms (see Fig. 1). Following this finding, different parameters of the SVR algorithms (see Table 5) were tuned using the GridSearchCV technique [32]. The model was trained with the training dataset and tested on the test dataset. Finally, different evaluation metrics, namely root-mean-square error (RMSE), mean absolute error (MAE), and accuracy, were used to evaluate the model. Data analysis of different variables in predicting LOS was also conducted.
Table 5 Tuning the parameters of the SVR algorithm Model for readmission
Classification predictive modelling was performed to predict the readmission. Models with Random Forest (RF), K Nearest Neighbor (KNN), Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Bidirectional Long Short-Term Memory (BI-LSTM) algorithms were compared for readmission prediction. For the BI-LSTM algorithm, the data are reshaped following the work of Masum et al. [33, 34] as the LSTM based RNN requires input to be in a matrix with the dimensions: [samples, time steps, features]. The model with BI-LSTM algorithm has been designed so that the network structure consists of three hidden layers with 100 LSTM units, then an output layer with the sigmoid activation. The network also represented binary crossentropy as a loss function, ADAM algorithm [35] as an optimizer, and accuracy as metrics. The network has been fitted with 20 epochs and a batch size of 2. 80% of data of the dataset was used for training and 20% was used for testing purpose.
Model for mortality
Classification predictive modelling was performed to predict the mortality. Models with Random Forest (RF), K Nearest Neighbor (KNN), Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Bidirectional Long Short-Term Memory (BI-LSTM) algorithms were compared for mortality, and 31- and 91-days mortality prediction. The model structure for readmission prediction mentioned in Sect. 2.2.2 was used for mortality prediction scenarios.
Comparing variables
The variable that needs to be predicted is known as the target variable and the variables that are used to predict the target variable known as features. Identifying the best features is an important task [36]. A large number of features could lead to complex model, long training time, the curse of dimensionality, noise addition, overfitting etc. On the other hand, a smaller number of variables could lead to the exclusion of relevant variables. ExtraTreeRegressor [37], ExtraTreesClassifier [37], LassoCV [38] and Correlation Matrix analysis with Heat Map of scikit-learn [27] have been considered for feature selection. Moreover, in all prediction cases, 80% of the dataset was used for training and 20% was used for testing purposes.
Feature selection for LOS
Extra Tree Regressor showed that Age, BMI, Surgical approach, Operation time, ASA, Blood loss, Preoperative T stage, Stoma formation, Sex and Preoperative nodal stage were the most crucial features in predicting LOS (see github repository [25]). In contrast, a LASSO algorithm showed that Surgical approach, Sex, Chemotherapy, ASA, Operation mode, Stoma formation, TumID10, Procedure type, Additional procedures, Radiotherapy, Preoperative T stage, Age and, Cancer site were the most important features (see github repository [25]). Moreover, the features explored through a correlation matrix with heat map has found Surgical approach, ASA, Age, Operation mode, Complication, Stoma formation, Chemotherapy, TumID10, Preoperative T stage as essential features (see github repository [25]). Following the findings from these techniques, we considered Age, ASA, Surgical approach, Stoma formation, Preoperative T stage, Chemotherapy, Operation mode, TumID10, Cancer site, and Radiotherapy as the selected features for predicting LOS.
Feature selection for readmission
Extra Tree Classifier showed that Surgical approach, Operation time, LOS, BMI, Age, ASA, Blood loss, Preoperative T stage, Stoma formation were the most crucial features in predicting readmission (see github repository [25]). In contrast, a LASSO algorithm showed that Surgical approach, Operation mode, Previous surgery type, Stoma formation, Preoperative nodal stage, the Specific procedure, ASA, BMI, Age, Sex, LOS and Cancer site were the most important features (see github repository [25]). Moreover, features explored through a correlation matrix with heat map have found Surgical approach, Age, Preoperative nodal stage, Pre abdominal surgery, Previous surgery type, Operation mode, Complication, Additional procedures, Robotic, Curative Surgery, Operation time and LOS as essential features (see github repository [25]). Following the findings from these techniques, we considered Age, Surgical approach, Stoma formation, Preoperative nodal stage, Operation time, Operation mode, Previous surgery type, LOS, and the Specific procedure as the selected features for predicting readmission.
Feature selection for mortality
Extra Tree Classifier showed that Age, LOS, Curative Surgery, BMI, ASA, Operation time, Surgical approach, Chemotherapy, Radiotherapy, Operation mode were the most critical features in predicting mortality (see github repository [25]). A LASSO algorithm showed that Curative Surgery, Chemotherapy, Preoperative T stage, Operation mode, ASA, Preoperative M stage, Stoma formation, Age, the Specific procedure, LOS, OPCS4, BMI, Cancer site, Previous surgery type, and Surgical approach were the most important features (see github repository [25]). Moreover, features explored through a correlation matrix with heat map has found ASA, Age, BMI, Preoperative T stage, Preoperative M stage, Operation mode, Procedure type, Complication, Stoma formation, Radiotherapy, Chemotherapy, Surgical approach, Curative Surgery and LOS as essential features (see github repository [25]). Following the findings from these techniques, we considered Age, ASA, BMI, Chemotherapy, Preoperative M stage, Surgical approach, LOS, Curative Surgery, Preoperative T stage and Stoma formation as the selected features for predicting mortality. For 31 days mortality prediction, we selected Chemotherapy, Additional procedures, Operation mode, Complication, Previous abdominal surgery, Previous surgery type, Surgical approach, Curative Surgery, Age and Resection as selected features. In contrast, for 91 days mortality prediction we selected LOS, Additional procedures, Curative Surgery, Complication, Previous surgery type, Pre abdominal surgery, Procedure type, Operation mode, Surgical approach and Resection as the selected features.
Model for comparing variables in LOS, readmission and mortality prediction
The model with the SVR algorithm mentioned in Sect. 2.2.1 was used in comparing variables in LOS prediction. To compare variables in readmission prediction, the model with BI-LSTM algorithm mentioned in Sect. 2.2.2 was used. Moreover, the model with BI-LSTM algorithm used in Sect. 2.2.3 was used in comparing variables in all scenarios of mortality prediction.