Machine Learning Algorithms for Predicting Surgical Outcomes after Colorectal Surgery: A Systematic Review

Background Machine learning (ML) has been introduced in various fields of healthcare. In colorectal surgery, the role of ML has yet to be reported. In this systematic review, an overview of machine learning models predicting surgical outcomes after colorectal surgery is provided. Methods Databases PubMed, EMBASE, Cochrane, and Web of Science were searched for studies using machine learning models for patients undergoing colorectal surgery. To be eligible for inclusion, studies needed to apply machine learning models for patients undergoing colorectal surgery. Absence of machine learning or colorectal surgery or studies reporting on reviews, children, study abstracts were excluded. The Probast risk of bias tool was used to evaluate the methodological quality of machine learning models. Results A total of 1821 studies were analysed, resulting in the inclusion of 31 articles. A vast proportion of ML algorithms have been used to predict the course of disease and response to neoadjuvant chemoradiotherapy. Radiomics have been applied most frequently, along with predictive accuracies up to 91%. However, most studies included a retrospective study design without external validation or calibration. Conclusions Machine learning models have shown promising potential in predicting surgical outcomes after colorectal surgery. However, large-scale data is warranted to bridge the gap between calibration and external validation. Clinical implementation is needed to demonstrate the contribution of ML within daily practice. Supplementary Information The online version contains supplementary material available at 10.1007/s00268-022-06728-1.


Introduction
Colorectal cancer is estimated to have approximately 2 million new cases and 1 million deaths per year [1]. Appendicitis cases appeared to be approximately 18 million in the last few years [2]. Performing colorectal surgical procedures come with several risks, such as postoperative bleeding, anastomotic leakage, or fistulas [3]. These complications could become a burden for surgeons because they lead to readmissions of patients and require revision surgery. Additionally, in patients with colorectal cancer, tumor recurrence or metastasis are commonly discovered, causing a decrease in survival for these patients [4]. Although chemotherapy has already demonstrated improvements in survivability for colorectal cancer patients, it is still difficult to predict which patients will completely respond to chemotherapy [5]. Therefore, risk stratification of patients with colorectal cancer remains challenging. Artificial Intelligence (AI) could support surgeons with this risk stratification by predicting postoperative complications, response to chemotherapy, and overall survival of colorectal cancer patients.
Recently, machine learning (ML), an essential branch of AI, has already been used for several complex tasks within healthcare. Examples of these tasks are the detection of tumors on radiologic images and prediction of biomarkers [6]. Due to its ability to train on large datasets and recognize patterns within data, machine learning algorithms are able to improve the accuracy of their prediction model [7]. Based on this capacity, machine learning models could be used to predict surgical outcomes prior to colorectal surgery [8]. By assessing several surgical outcomes with AI, surgeons could preoperatively decide the most efficient clinical pathway for patients undergoing colorectal surgery [9]. Currently, there are several machine learning algorithms available to make these predictions, an overview of algorithms is presented in Table 1.
Although machine learning algorithms have shown major potential to improve surgical outcomes, the current status and quality of machine learning models within colorectal surgery have not been evaluated in recent literature. However, it is essential to bridge this gap in order to understand the extent of predicted surgical outcomes, generalizability, and validity of current machine learning algorithms applied in colorectal surgery. Therefore, this systematic review aims to provide a comprehensive overview of machine learning algorithms that have been used to predict any surgical outcome after general colorectal surgery. This review also evaluates the area under the curve and/or accuracy of included machine learning models.

Materials and methods
Literature was retrieved and systematically reviewed in accordance with the Cochrane Handbook for Systematic Reviews of Interventions version 6.0 and PRISMA guidelines.

Literature search strategy
A systematic search was performed in the databases: PubMed, Embase.com, Clarivate Analytics/Web of Science Core Collection and the Wiley/Cochrane Library. The timeframe within the databases was from inception to the 7th of July 2021 and conducted by G.L.B. and M.B. The search included keywords and free text terms for (synonyms of) 'machine learning' combined with (synonyms of) 'digestive system surgical procedures'. This search strategy was peer-reviewed by an information specialist (G.L.B.), using the PRESS checklist. A full overview of the search terms per database can be found in the supplementary information (see Appendix 1 as ESM). No limitations on date were applied in the search. Studies reporting on conference proceedings, book chapters, editorials, Algorithms that are able to improve the prediction accuracies by training on large data [10] Decision tree A model that consists of nodes and branches, representing variables and related outcomes. Various combinations of outcomes give several predictions. The end model will be the smallest tree that fits the data best [11] Gradient boosting (GBM) Builds models that focus on inaccuracies of preceding models and improves these parts until the most accurate model is formed [12] Random forest Combines multiple decision trees to build the final accurate prediction model [13] Support vector machine (SVM) Finds the optimal border in the dataset to classify outcomes in two groups [14] Artificial neural networks (ANNs) Trains by using various processing layers to automatically find relevant features for the prediction. Additionally, weights of the extracted features are adjusted to form the most accurate model [15] Convolutional neural networks (CNNs) Similar to ANNs, except these models use filters instead of weight for extracted features [16] Deep learning Deep learning algorithms function similarly to neural networks, however, deep learning models have more layers or depth than neural networks [17] errata, letters, notes, surveys, or tombstones were excluded from the search.

Eligibility criteria
Studies were only eligible if they specifically met the following criteria: (i) described machine learning methods, (ii) involved patients undergoing any type of colorectal surgery, (iii) reported predictive performance of the machine learning model, (iv) clinical study. Regression models could be seen as machine learning. Nonetheless, regression models have existent in healthcare for many years. As this review is addressing new machine learning models only, regression models are therefore excluded from this review. In addition, appendectomy procedures were considered as colorectal surgery. Studies were excluded if they (i) were not written in English, (ii) reported on reviews, editorials, letters, or study abstracts. No specific study design or setting was preferred in the inclusion criteria.

Study selection
Two reviewers (M.B. & J.C.P.) independently performed the title and abstract screening in conformity with the inclusion and exclusion criteria. Eligible articles were read in full text, and duplicate studies were eliminated. The fulltext screening of the retrieved articles was performed by the same two reviewers (M.B. & J.C.P.) to secure they comply with the inclusion criteria. Disagreements were resolved by discussions between two reviewers, resulting in consensus.

Risk of bias assessment
The Probast risk of bias tool was independently applied to each study by two reviewers (M.B. & J.C.P.) to assess the methodological quality of included machine learning models [19]. This tool is able to evaluate the overall risk of bias based on four bias domains: participant selection, predictors, outcomes, and analysis.

Data collection process
A table was formed for the extraction of all data. All data aspects were independently extracted and double-checked by two of the authors (M.B. & J.C.P.). Conflicts were resolved by consensus between the two authors. No additional processes were required for this data.

Data items
An inventory of data items was formed according to the Cochrane guidance for data collection, and the CHARMS checklist [20]. The following information was extracted from each study: first author, publication year, country of research, number of patients, mean age, study design, surgical procedure, intervention, surgical outcome, internal validation method, external validation, predictive performance (discrimination, and calibration). For studies involving multiple machine learning models, predictive performance of each model was described separately.

Data synthesis
A descriptive summary was used to represent the type of machine learning models, predicted surgical outcomes, risk of bias assessment, and model validation. To illustrate the predictive performance of machine learning models, results of machine learning studies were reported for each predicted outcome. To represent the discriminative ability, the range of mean accuracy (ACC) and area under the curve (AUC) was described for machine learning models of each predicted outcome. Additionally, the proportion of machine learning models that have applied calibration was described, along with the calibration method. A comparative meta-analysis of machine learning models was not possible, due to heterogeneity in study methodology, and the report on outcomes.

Results
The search strategy provided a total of 1821 studies after removal of duplicates ( Fig. 1). Therefore, 1821 studies were screened for eligibility based on the title and abstract. After excluding 1763 studies, 58 studies remained for a full-text assessment. In the end, 31 studies were included in this systematic review.

Machine learning models
Various machine learning algorithms have been applied to patients undergoing colorectal surgery. The frequencies of applied machine learning models were as follows: radiomics (n = 13), neural networks (n = 7), multiple machine learning (n = 6), random forest (n = 4), gradient boosting (n = 1).

Surgical outcomes
Surgical outcomes of these machine learning models predominantly included prediction of the clinical staging and prognosis (n = 9), chemoradiotherapy response (n = 7), and postoperative complications (n = 7). Remaining studies involved prediction of diagnosis (n = 4), success of intervention (n = 2), and pre-and postoperative management (n = 2). An overview of key study characteristics is presented in Table 2.

Methodological quality assessment
Based on the Probast tool, the majority of studies received a low risk of bias score for the predictors and outcome domains. For most studies, the participants and analysis domains have received unclear or high risk of bias scores due to inappropriate inclusion criteria or measures to  account for overfitting and missing data. Therefore, a low overall bias was given for 29% of the studies, whereas 48% of the studies received an unclear overall bias. Additionally, a high overall bias was decided for 23% of the studies (Fig. 2).

Model validation
For internal validation of machine learning models, most studies used cross-validation (n = 17), a random split of the dataset (n = 11), or bootstrapping (n = 3). External validation was performed in four studies (13%), including two radiomics, one ANN, and one random forest model. The discriminative ability (AUCs) of these models ranged between 0.64 and 0.9, and the calibration was reported for one machine learning model.

Discussion
This review illustrates the capabilities of machine learning in predicting several surgical outcomes for patients undergoing colorectal surgery. In this study, promising discriminative abilities of applied ML models have been discovered, especially for radiomic models. Nine studies have used machine learning algorithms to predict the course of disease with accuracies ranging between 70 and 90%. Radiomics models have shown highest accuracies in these predictions. Theoretically, the use of ML could improve pre-operative decision-making for patients undergoing colorectal surgery, eventually enabling individualized surveillance for patients. For patients with high risks of metastasis, treatment decision such as minimal or aggressive surgery could be reconsidered for optimal surgical outcomes. However, most studies included small cohorts, this might give rise to the problem of overfitting, in which the ML model is overly adjusted to the training dataset and is unable to perform well on the test set [52,53]. Although measures such as cross-validation and feature selection might help, this problem could be solved by including an external validation cohort [54]. Seven studies have applied machine learning to predict response to neoadjuvant chemoradiation therapy (nCRT) with accuracies between 71 and 91%. Radiomics appeared to perform this prediction with the highest accuracies. Although chemoradiotherapy has already shown improved outcomes for patients with advanced rectal cancer, incomplete therapy response and overtreatment of nCRT could occur [55]. Surgeons experience difficulties in determining patients who would completely respond to nCRT [56]. By using machine learning, surgeons could improve risk stratification, and decide to tailor therapy to patients with predicted nCRT response. This might eventually enable personalized decision-making for every patient, preventing unnecessary hospital stays and costs.
Seven studies have attempted to predict postoperative complications. Accuracies of ML models have ranged from 47 and 96%, in which random forests had the best predictive performance. Ideally, colorectal surgeons could use machine learning models to accurately predict postoperative complications for every patient. Subsequently, early discharge, enhanced monitoring or prophylactic steps could be implemented based on the predicted risk of complications. In addition, one study developed a predictive model for mortality in patients undergoing acute abdominal surgery [43]. This could potentially be helpful for clinical decision-making in acute surgery. Nonetheless, these ML studies have primarily included preoperative risk factors for postoperative complications. Previous studies have already indicated that postoperative complications are dependent upon several preoperative, intraoperative, and postoperative risk factors [57]. Therefore, more datasets are required to reveal essential intraoperative and postoperative factors for the prediction of postoperative complications.
For predicting patients with acute appendicitis, ML models have performed with accuracies up to 98%. Akmese et al. have demonstrated that ML could be applied with web-based interfaces, with internet as the only necessary criteria. Prabhudesai et al. have discovered that neural networks are able to predict appendicitis cases better than clinicians. These two findings may suggest that ML models could be practical and accurate tools for improving surgical decision-making. With proper use, surgeons could diagnose faster and prevent unnecessary appendectomies.
Although high accuracies have been found for machine learning models within this review, it seems that some uncertainties are still present. External validation was missing in most of the studies (87%), indicating that most machine learning models have not been applied to data from external hospital settings. However, external validation is crucial to demonstrate the generalizability of machine learning models [58]. Additionally, calibration was not reported in most studies (90%), while calibration reflects the similarity between predicted risks and the true observed risks [59]. Poor calibration indicates that the machine learning model is under-or overestimating the desired outcome.
This review has some limitations. Due to the heterogeneity in methodologies of studies, a comparative metaanalysis of ML models was not possible. Additionally, a number of studies have not described predictive performances of ML models in ACC or AUCs, possibly leading to an over-or underrepresentation of actual discriminative abilities.
Future studies should focus on the external validation of ML models. Since external validation is important for the generalizability of machine learning algorithms, gaining this validation could facilitate the introduction of machine learning in daily clinical practice. However, large-scale datasets are required for this external validation, existing patient databases could be used to fulfill this need [60]. With proper use of these data, surgeons may achieve personalized decision-making for patients undergoing colorectal surgery. In addition, the calibration of machine learning models should be demonstrated in future studies to represent the extent of consensus between predicted outcomes and outcomes in the clinics.
In conclusion, this review shows the promising potential of ML in predicting various surgical outcomes for patients undergoing colorectal surgery. However, clinical implementation is required to demonstrate the contribution of ML within daily practice. The use of large patient databases may be required to fulfill the need for calibration and external validation.
Acknowledgements M Bektaş: participated in the design of the study, data collection and interpretation, wrote and submitted the manuscript. JB Tuynman: participated in the design of the study, and revised the manuscript critically. J Costa Pereira: participated in the design of the study, and interpretation of data. GL Burchell: performed the literature search and wrote parts of the manuscript. DL van der Peet: participated in the design of the study, revised the manuscript critically, and submitted the manuscript. All authors approved the final version of the manuscript.
Funding No grants or funding were received for this research.

Declarations
Conflict of interest The authors declare that they have no conflicts of interest.
Human or animal rights This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent No informed consent was required for this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.