Introduction

Colorectal cancer is estimated to have approximately 2 million new cases and 1 million deaths per year [1]. Appendicitis cases appeared to be approximately 18 million in the last few years [2]. Performing colorectal surgical procedures come with several risks, such as postoperative bleeding, anastomotic leakage, or fistulas [3]. These complications could become a burden for surgeons because they lead to readmissions of patients and require revision surgery. Additionally, in patients with colorectal cancer, tumor recurrence or metastasis are commonly discovered, causing a decrease in survival for these patients [4]. Although chemotherapy has already demonstrated improvements in survivability for colorectal cancer patients, it is still difficult to predict which patients will completely respond to chemotherapy [5]. Therefore, risk stratification of patients with colorectal cancer remains challenging. Artificial Intelligence (AI) could support surgeons with this risk stratification by predicting postoperative complications, response to chemotherapy, and overall survival of colorectal cancer patients.

Recently, machine learning (ML), an essential branch of AI, has already been used for several complex tasks within healthcare. Examples of these tasks are the detection of tumors on radiologic images and prediction of biomarkers [6]. Due to its ability to train on large datasets and recognize patterns within data, machine learning algorithms are able to improve the accuracy of their prediction model [7]. Based on this capacity, machine learning models could be used to predict surgical outcomes prior to colorectal surgery [8]. By assessing several surgical outcomes with AI, surgeons could preoperatively decide the most efficient clinical pathway for patients undergoing colorectal surgery [9]. Currently, there are several machine learning algorithms available to make these predictions, an overview of algorithms is presented in Table 1.

Table 1 Terminology of AI subsets

Although machine learning algorithms have shown major potential to improve surgical outcomes, the current status and quality of machine learning models within colorectal surgery have not been evaluated in recent literature. However, it is essential to bridge this gap in order to understand the extent of predicted surgical outcomes, generalizability, and validity of current machine learning algorithms applied in colorectal surgery. Therefore, this systematic review aims to provide a comprehensive overview of machine learning algorithms that have been used to predict any surgical outcome after general colorectal surgery. This review also evaluates the area under the curve and/or accuracy of included machine learning models.

Materials and methods

Literature was retrieved and systematically reviewed in accordance with the Cochrane Handbook for Systematic Reviews of Interventions version 6.0 and PRISMA guidelines.

Literature search strategy

A systematic search was performed in the databases: PubMed, Embase.com, Clarivate Analytics/Web of Science Core Collection and the Wiley/Cochrane Library. The timeframe within the databases was from inception to the 7th of July 2021 and conducted by G.L.B. and M.B. The search included keywords and free text terms for (synonyms of) 'machine learning' combined with (synonyms of) 'digestive system surgical procedures'. This search strategy was peer-reviewed by an information specialist (G.L.B.), using the PRESS checklist. A full overview of the search terms per database can be found in the supplementary information (see Appendix 1 as ESM). No limitations on date were applied in the search. Studies reporting on conference proceedings, book chapters, editorials, errata, letters, notes, surveys, or tombstones were excluded from the search.

Eligibility criteria

Studies were only eligible if they specifically met the following criteria: (i) described machine learning methods, (ii) involved patients undergoing any type of colorectal surgery, (iii) reported predictive performance of the machine learning model, (iv) clinical study. Regression models could be seen as machine learning. Nonetheless, regression models have existent in healthcare for many years. As this review is addressing new machine learning models only, regression models are therefore excluded from this review. In addition, appendectomy procedures were considered as colorectal surgery. Studies were excluded if they (i) were not written in English, (ii) reported on reviews, editorials, letters, or study abstracts. No specific study design or setting was preferred in the inclusion criteria.

Study selection

Two reviewers (M.B. & J.C.P.) independently performed the title and abstract screening in conformity with the inclusion and exclusion criteria. Eligible articles were read in full text, and duplicate studies were eliminated. The full-text screening of the retrieved articles was performed by the same two reviewers (M.B. & J.C.P.) to secure they comply with the inclusion criteria. Disagreements were resolved by discussions between two reviewers, resulting in consensus.

Risk of bias assessment

The Probast risk of bias tool was independently applied to each study by two reviewers (M.B. & J.C.P.) to assess the methodological quality of included machine learning models [19]. This tool is able to evaluate the overall risk of bias based on four bias domains: participant selection, predictors, outcomes, and analysis.

Data collection process

A table was formed for the extraction of all data. All data aspects were independently extracted and double-checked by two of the authors (M.B. & J.C.P.). Conflicts were resolved by consensus between the two authors. No additional processes were required for this data.

Data items

An inventory of data items was formed according to the Cochrane guidance for data collection, and the CHARMS checklist [20]. The following information was extracted from each study: first author, publication year, country of research, number of patients, mean age, study design, surgical procedure, intervention, surgical outcome, internal validation method, external validation, predictive performance (discrimination, and calibration). For studies involving multiple machine learning models, predictive performance of each model was described separately.

Data synthesis

A descriptive summary was used to represent the type of machine learning models, predicted surgical outcomes, risk of bias assessment, and model validation. To illustrate the predictive performance of machine learning models, results of machine learning studies were reported for each predicted outcome. To represent the discriminative ability, the range of mean accuracy (ACC) and area under the curve (AUC) was described for machine learning models of each predicted outcome. Additionally, the proportion of machine learning models that have applied calibration was described, along with the calibration method. A comparative meta-analysis of machine learning models was not possible, due to heterogeneity in study methodology, and the report on outcomes.

Results

The search strategy provided a total of 1821 studies after removal of duplicates (Fig. 1). Therefore, 1821 studies were screened for eligibility based on the title and abstract. After excluding 1763 studies, 58 studies remained for a full-text assessment. In the end, 31 studies were included in this systematic review.

Fig. 1
figure 1

PRISMA flow chart of the study selection

Machine learning models

Various machine learning algorithms have been applied to patients undergoing colorectal surgery. The frequencies of applied machine learning models were as follows: radiomics (n = 13), neural networks (n = 7), multiple machine learning (n = 6), random forest (n = 4), gradient boosting (n = 1).

Surgical outcomes

Surgical outcomes of these machine learning models predominantly included prediction of the clinical staging and prognosis (n = 9), chemoradiotherapy response (n = 7), and postoperative complications (n = 7). Remaining studies involved prediction of diagnosis (n = 4), success of intervention (n = 2), and pre-and postoperative management (n = 2). An overview of key study characteristics is presented in Table 2.

Table 2 Key characteristics of included studies

Methodological quality assessment

Based on the Probast tool, the majority of studies received a low risk of bias score for the predictors and outcome domains. For most studies, the participants and analysis domains have received unclear or high risk of bias scores due to inappropriate inclusion criteria or measures to account for overfitting and missing data. Therefore, a low overall bias was given for 29% of the studies, whereas 48% of the studies received an unclear overall bias. Additionally, a high overall bias was decided for 23% of the studies (Fig. 2).

Fig. 2
figure 2

Methodological assessment of ML models, according to the Probast risk of bias tool

Model validation

For internal validation of machine learning models, most studies used cross-validation (n = 17), a random split of the dataset (n = 11), or bootstrapping (n = 3). External validation was performed in four studies (13%), including two radiomics, one ANN, and one random forest model. The discriminative ability (AUCs) of these models ranged between 0.64 and 0.9, and the calibration was reported for one machine learning model.

Predictive performance

The AUCs of machine learning models for predicting clinical staging and prognosis have ranged between 0.7 and 0.95, whereas the accuracies were discovered to be between 72 and 86% [21,22,23,24,25,26,27,28,29]. For the prediction of chemotherapy response, AUCs between 0.71 and 0.91 were discovered, accuracies varied between 71 and 84% [30,31,32,33,34,35,36]. In the machine learning group for prediction of postoperative complications, AUCs have ranged from 0.6 and 0.96, additionally, accuracies were found to be between 81 and 93% [37,38,39,40,41,42,43]. For machine learning models predicting appendicitis cases, AUCs varied from 0.91 to 0.95, and accuracies ranged between 65 and 98% [44,45,46,47]. In the group for prediction of intervention success, AUCs were up to 0.93 [48, 49]. For prediction of pre-and postoperative management, machine learning models showed AUCs up to 0.86 [50, 51]. Calibration was described for three models (10%), in which two studies used a Hosmer–Lemeshow test, and one study used a calibration plot only. Additionally, one study did not use AUC or accuracies to describe predictive performance of the machine learning model.

Discussion

This review illustrates the capabilities of machine learning in predicting several surgical outcomes for patients undergoing colorectal surgery. In this study, promising discriminative abilities of applied ML models have been discovered, especially for radiomic models.

Nine studies have used machine learning algorithms to predict the course of disease with accuracies ranging between 70 and 90%. Radiomics models have shown highest accuracies in these predictions. Theoretically, the use of ML could improve pre-operative decision-making for patients undergoing colorectal surgery, eventually enabling individualized surveillance for patients. For patients with high risks of metastasis, treatment decision such as minimal or aggressive surgery could be reconsidered for optimal surgical outcomes. However, most studies included small cohorts, this might give rise to the problem of overfitting, in which the ML model is overly adjusted to the training dataset and is unable to perform well on the test set [52, 53]. Although measures such as cross-validation and feature selection might help, this problem could be solved by including an external validation cohort [54].

Seven studies have applied machine learning to predict response to neoadjuvant chemoradiation therapy (nCRT) with accuracies between 71 and 91%. Radiomics appeared to perform this prediction with the highest accuracies. Although chemoradiotherapy has already shown improved outcomes for patients with advanced rectal cancer, incomplete therapy response and overtreatment of nCRT could occur [55]. Surgeons experience difficulties in determining patients who would completely respond to nCRT [56]. By using machine learning, surgeons could improve risk stratification, and decide to tailor therapy to patients with predicted nCRT response. This might eventually enable personalized decision-making for every patient, preventing unnecessary hospital stays and costs.

Seven studies have attempted to predict postoperative complications. Accuracies of ML models have ranged from 47 and 96%, in which random forests had the best predictive performance. Ideally, colorectal surgeons could use machine learning models to accurately predict postoperative complications for every patient. Subsequently, early discharge, enhanced monitoring or prophylactic steps could be implemented based on the predicted risk of complications. In addition, one study developed a predictive model for mortality in patients undergoing acute abdominal surgery [43]. This could potentially be helpful for clinical decision-making in acute surgery. Nonetheless, these ML studies have primarily included preoperative risk factors for postoperative complications. Previous studies have already indicated that postoperative complications are dependent upon several preoperative, intraoperative, and postoperative risk factors [57]. Therefore, more datasets are required to reveal essential intraoperative and postoperative factors for the prediction of postoperative complications.

For predicting patients with acute appendicitis, ML models have performed with accuracies up to 98%. Akmese et al. have demonstrated that ML could be applied with web-based interfaces, with internet as the only necessary criteria. Prabhudesai et al. have discovered that neural networks are able to predict appendicitis cases better than clinicians. These two findings may suggest that ML models could be practical and accurate tools for improving surgical decision-making. With proper use, surgeons could diagnose faster and prevent unnecessary appendectomies.

Although high accuracies have been found for machine learning models within this review, it seems that some uncertainties are still present. External validation was missing in most of the studies (87%), indicating that most machine learning models have not been applied to data from external hospital settings. However, external validation is crucial to demonstrate the generalizability of machine learning models [58]. Additionally, calibration was not reported in most studies (90%), while calibration reflects the similarity between predicted risks and the true observed risks [59]. Poor calibration indicates that the machine learning model is under- or overestimating the desired outcome.

This review has some limitations. Due to the heterogeneity in methodologies of studies, a comparative meta-analysis of ML models was not possible. Additionally, a number of studies have not described predictive performances of ML models in ACC or AUCs, possibly leading to an over- or underrepresentation of actual discriminative abilities.

Future studies should focus on the external validation of ML models. Since external validation is important for the generalizability of machine learning algorithms, gaining this validation could facilitate the introduction of machine learning in daily clinical practice. However, large-scale datasets are required for this external validation, existing patient databases could be used to fulfill this need [60]. With proper use of these data, surgeons may achieve personalized decision-making for patients undergoing colorectal surgery. In addition, the calibration of machine learning models should be demonstrated in future studies to represent the extent of consensus between predicted outcomes and outcomes in the clinics.

In conclusion, this review shows the promising potential of ML in predicting various surgical outcomes for patients undergoing colorectal surgery. However, clinical implementation is required to demonstrate the contribution of ML within daily practice. The use of large patient databases may be required to fulfill the need for calibration and external validation.