Introduction

Esophageal cancer (EC), which occupies the ninth position in terms of global cancer prevalence, is the sixth most common cause of cancer mortality1. Annually, it is responsible for the demise of over half a million individuals worldwide2,3. From a histological viewpoint, the disease mainly bifurcates into esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC), each exhibiting unique patterns of metastasis that typically manifest at different stages of the disease progression4. Due to the predominantly asymptomatic nature of the early stages, esophageal cancer diagnosis often occurs at an advanced phase, where it is commonly accompanied by distant metastatic spread5. Patterns of metastasis in esophageal cancer can be classified into three major types: lymphatic, hematogenous, and direct diffusion. The latter typically becomes evident in the advanced stages, marked by tumor invasion into adjacent structures following penetration through the esophageal adventitia. Hematogenous metastasis is primarily secondary to lymph node involvement, facilitating the tumor’s spread to distant organs via the vascular system6. The lymphatic pathway, however, is recognized as the principal vector for metastatic dissemination in esophageal cancer, critically affecting patient prognosis and contributing to pertinent prognostic considerations7,8.

Metastatic sites of esophageal cancer encompass the liver, brain, lungs, bones, and others. However, liver metastasis in esophageal cancer engenders a substantial impact on patient prognosis. Not only does it signal advanced-stage disease, but it also portends a poor prognosis, resulting in metabolic disorders due to liver dysfunction, circulatory problems stemming from liver failure, pain, weight loss, and the potential development of multiple organ dysfunction syndrome (MODS) in the advanced stages25.

In light of this, advanced machine learning (ML) models were employed in this study. In comparison to traditional logistic models, machine learning techniques unlock richer information within extensive datasets, thus achieving superior outcome prediction accuracy10. ML technology has already found wide-ranging applications in science and society, ranging from driverless cars to board games to decision-making processes11. In the field of biomedicine, the emergence of big data in healthcare12,13 presents tremendous potential for ML to comprehend disease and health. Consequently, ML has been integrated into clinical diagnostics, precision therapeutics, and health monitoring14.

Given that patients with esophageal cancer exhibit varying clinical-pathological stages and receive different treatments, prognostic outcomes also differ significantly. Unfortunately, limited research currently focuses on hepatic metastasis metastasis in advanced esophageal cancer, thereby posing challenges for clinical decision-making among physicians24. Therefore, the objective of this research is to formulate and validate a machine learning model characterized by its strong predictive capabilities, and to integrate this model into an accessible web-based tool designed to facilitate the prediction of liver metastasis risk in individuals diagnosed with esophageal cancer.

Materials and methods

Study population

In the study, we used SEER*stat 8.4.1 software to download the patients’ data from the SEER database. Patients diagnosed with esophageal cancer (SCC and AC) between 2010 and 2020 were involved in this study. Exclusion criteria were detailed as follows: (1) Excluded unknown bone, brian, liver and lung metastatic status; (2) Excluded unknown AJCC T, N stage; (3) Excluded unknown race and histology grade; (4) Excluded unknown primary site; (5) Excluded unknown Histologic Type and Surgery; (6) Excluded unknown marital status. A study flow chart of case screening was presented in Fig. 1.

Figure 1
figure 1

The study flow chart of case screening.

Data selection

In this study, 16 variables related to the clinicopathology and demographics of patients were selected for analysis. Demographic variables included age, sex, marital status, race. Clinicopathological variables included primary site, tumor histology, tumor grade, T stage, N stage, surgery, radiation, chemotherapy, brain metastasis, bone metastasis, lung metastasis, liver metastasis. According to the ICD-O-3 codes, histological types of esophageal cancere divided into 2 categories, including adenocarcinoma (8140–8573), squamous cell carcinoma (8050–8082). All esophageal cancer patients were staged according the AJCC 8th edition guidelines and SEER staging information. In addition, X-tile software was used to calculate cut-off value of age.

Data pre-processing and feature engineering

All statistical analyses were conducted with Python3.8, SPSS 23. In this study We performed a logistic regression analysis on data collected in the SEER database to identify suitable variables for machine learning model by using SPSS 23 software. Significant variables from HM patients were identified by univariate logistic regression analysis (P < 0.05). Then, these variables were enclosed within multivariate logistic regression analysis, and variables with a P < 0.05 in multivariate logistic regression analysis were subjected for further analysis of ML model. Correlation analysis was used to analyze the correlation among the selected features. Since this data set is an unbalanced data set, the over-sampling method were adopted for data processing15. The key of this method is to oversampling the data samples of small classes to increase the number of data samples of small classes to improve the accuracy of the model. Meanwhile, to compare the importance of each feature, we extract the feature importance of each variable in the machine learning model according to the Permutation Importance principle16,17.

Model establishment and evaluation

Data from the SEER database were randomly assigned to train set and internal test set in a ratio of 3:7. Six commonly used classifier algorithms were chosen to this study, including three ensemble algorithms11 Random Forest (RF), Gradient Boosting ine (GBM), eXtreme gradient boosting (XGB) and three simple classification algorithms Logistic Regression (LR), Decision tree (DT), Naive Bayes classifiers (NBC). The ML models were trained using Python software. In the training group, all SEER data was divided into 10 parts for 10 × cross-validation20. For the internal test group data is directly imported into the built model for verification. The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy and F-score were evaluated indicators of ML algorithms. The probability density plot and clinical utility curve (CUC) was utilized to examine clinical applicability. Furthermore, based on the best-performing model, we built a web-based online calculator.

Ethical disclosure statement

The authors stated that no human or animal experiments were adopted in this study.

Results

Clinical characteristics of patients

In evaluating the train (N = 12,460) and test (N = 5340) sets of esophageal cancer patients, no significant differences were observed in terms of age distribution, sex, marital status, race, tumor characteristics, and treatments received, with P-values exceeding 0.05 for all compared variables. The most common tumor location was the lower third of the esophagus, and adenocarcinoma was the prevalent histology type. The rates of the various interventions and metastases, including liver metastasis (9.3% in training vs. 9.0% in testing), were similarly distributed between the two sets, indicating a well-matched cohort for further predictive analysis (Table 1).

Table 1 Clinical and pathological characteristics of train set and internal test set.

Univariable and multivariable logistic regression analysis

11 risk factors associated with hepatic metastasis including age, primary site, tumor histology, tumor grade, T stage, N stage, surgery, radiation, chemotherapy, bone metastasis, lung metastasis were identified using univariable and multivariable LR analysis (P < 0.05, Table 2). Based on these risk factors, we developed six different models using machine learning (ML) algorithms in this study.

Table 2 Univariate analysis and multivariate logistic regression analysis of variables.

Correlation analysis and Importance of features on prediction

In order to assess the level of correlation between factors, correlation analysis is commonly employed. In this study, we utilized Spearman correlation analysis to examine the independence between data features. A correlation heat map was generated, as depicted in Fig. 2A, which depicted the absence of significant correlation among the 15 features under investigation. Figure 2B presents the significance of features extracted from each machine learning algorithm. The variables identified through univariate and multivariate logistic analysis have all played a remarkable role in predicting outcomes across the six models. Notably, surgery consistently emerged as the most influential feature in the majority of prediction models, underscoring its significant impact on hepatic metastasis in esophageal cancer. In most algorithms, T stage, age, primary, N stage and tumor grade ranked the last five, with no significant difference in their contributions to the model. Lung metastasis, radiation, bone metastasis, histology, chemotherapy, T stage, age, primary, N stage and tumor grade are arranged in descending order in GBM model.

Figure 2
figure 2

(A) Heat map of the correlation of features. (B) Feature importance of different models.

Model performance

The performance of the six predictive models is described in Fig. 3A,B and Table 3. Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models with an average AUC of 0.893, followed by the LR model (AUC = 0.882). Internal test validation was shown in Table 3 and Fig. 3B. Interestingly, the GBM model also achieves the best AUC score (0.885) in the internal test validation and the score of accuracy, sensitivity (recall rate) and specificity were 0.868, 0.667 and 0.888, respectively. The confusion matrix (Fig. 3C) of the GBM model in the training set and the test set indicated its high accuracy. The probability density plot (Fig. 3D) depicting predictive distribution showed that the AUC was highest when the predictive score was 0.38. The CUC plot (Fig. 3E) also showed good clinical applicability.

Figure 3
figure 3

(A) Ten-fold cross-validation results of different machine learning models. (B) The roc curves of different machine learning models in internal test set. (C) The confusion matrix of the GBM model in the train set and the internal test set. TP true positive, TN true negative, FP false positive, FN false negative. (D) Probability density plot of gradient boosting machine model. (E) The clinical impact curve of gradient boosting machine model.

Table 3 Prediction performance of different models.

Web predictor

This study aimed to develop a web predictor utilizing the GBM model, which exhibited superior predictive performance for hepatic metastasis in patients with esophageal cancer. The primary objective of this web predictor is to provide doctors with a valuable tool for making more precise clinical decisions. By inputting the relevant variables associated with hepatic metastasis into the web predictor, healthcare professionals can conveniently calculate the odds of hepatic metastasis in patients with esophageal cancer. For easy access, the web predictor can be accessed at the following link: (https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/). Please refer to Fig. 4 for further details.

Figure 4
figure 4

A web predictor for predicting HM in EC.

Discussion

Esophageal cancer is a remarkably fatal malignancy, with a prevalence of distant metastases reaching up to 42% in newly diagnosed patients, prominently affecting the liver as the most frequently involved organ26,27,28. The effective treatment and comprehensive management of metastatic esophageal cancer necessitate a multimodal strategy, which continues to pose significant challenges. Therefore, it is of crucial significance for clinical decision-making to identify high-risk factors of esophageal cancer and accurately predict whether patients will develop liver metastasis based on their individual and unique clinical and pathological characteristics.

Currently, the HM of advanced esophageal cancer remains understudied in the scientific literature. Prognostic research in this domain is predominantly focused on two key aspects. Firstly, there is a conspicuous paucity of exploratory investigations into the high-risk prognostic factors associated with esophageal cancer. Additionally, further exploration of the interrelationships among these independent prognostic factors is noticeably lacking. Secondly, there is a dearth of research on HM models for advanced esophageal cancer that leverage the immense potential of big data. Consequently, there is an urgent need for comprehensive studies in these areas to contribute to an improved understanding and accurate prognostication of advanced esophageal cancer.

Some studies believe that smoking and drinking are the most common risk factors for male esophageal cancer29. Some previous studies30 have also shown that for cancer patients, the degree of tissue differentiation, pathological N-stage, vascular invasion, and neuroinvasion are recognized factors that affect the prognosis of patients with esophageal cancer31,32,33,34. The conclusions of these studies lacked the support of big data and did not address the prediction on HM of advanced esophageal cancer. Based on big data analysis of SEER database, our study screened out independent high risk factors associated with HM by logistic regression analysis. This study included 15 clinically common relevant factors associated with advanced esophageal cancer with liver metastasis, which are: age, sex, Marital status, Race, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Brain metastasis, Bone metastasis, Lung metastasis. To identify the independence between features, we obtained a correlation heat map by Spearman correlation analysis. There was no strong correlation among these 15 features by the Fig. 2A. Moreover, 11 independent high risk factors related to liver metastasis were screened by logistic regression analysis, which were as follows: age, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Bone metastasis, Lung metastasis.

Undoubtedly, the construction of prediction models for HM of advanced esophageal cancer is equally significant to the exploration of independent high risk factors in this context. Presently, there is a notable dearth of studies focused on risk factors in esophageal cancer patients with distant organ metastases35. For instance, Tang et al. previously constructed a nomogram to predict the survival of patients with metastatic esophageal cancer; however, this study encompassed metastases to all anatomical sites, without specifically exploring a prediction model for predicting the risk of distant metastasis36. Similarly, Cheng et al. established models for predicting both the risk and survival of esophageal cancer patients, albeit those specifically tailored to brain metastasis37. Furthermore, Guo et al. provided detailed characteristics and explored risk and prognostic factors for patients with liver metastasis, yet they did not develop any predictive tools38. Considering that liver metastasis represents the most common site of distant spread, conducting a comprehensive investigation specifically targeting esophageal cancer patients with liver metastasis assumes paramount clinical importance.

Previous studies have constructed nomograms to predict EC metastasis based on traditional logistic models. However, the limitations of this method in prediction accuracy and processing big data have made it difficult to make great breakthroughs in precision medicine9,10. And traditional research cannot exploration the interaction between different independent high risk factors18,19. In contrast, our study can better document complex associations between different independent high risk factors, thereby improving the accuracy of the model20. Previous studies have used nomogram methods to build a model for predicting the metastasis of patients with esophageal cancer based on the data of patients with esophageal cancer in the SEER database, but these studies did not involve the establishment of a predicting model for HM of advanced metastatic esophageal cancer by ML21.

We then constructed six prediction models using ML, Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models. Leveraging these findings, we have successfully devised an openly accessible online calculator (https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/) based on the GBM model. The model we have developed accurately predicts patients' risk of HM based on various clinical indicators. Clinicians can access this model through the provided website to input patient information and obtain corresponding predictions of hepatic metastases, thereby facilitating clinical decision-making.

Our research has the following advantages. Firstly, this study established a statistical model based on machine learning that can predict the HM of patients with EC. To the best of our knowledge, we are the first to use ML to construct a prediction model of LM of EC. This model is more reliable than the traditional nomogram prediction model. And this work expanded our knowledge of advanced EC. Second, our study further explores the relationship between different independent high risk factors, which provides a new direction for future clinical research. In other words, clinical research should not only explore the metastasis of patients, but also explore the correlation between different independent high risk factors, so as to better find the relationship between these factors and further eliminate the factors that are not conducive to the metastasis of patients during perioperative period.

Meanwhile, this study has some limitations. First, Current machine learning is almost entirely statistical or black-box, bring severe theoretical limitations to its performance23. Second, this study is a single-center study with limited number of patients included, and the application of machine learning model on large data sets can obtain more stable results22. Therefore, in subsequent studies, multi-center data can be added for training and external verification, so as to obtain a more reliable prediction model. Third, this study did not include neoadjuvant therapy, surgical methods, circulating tumor DNA and other factors that may affect the long-term prognosis of patients with esophageal cancer. In the future, with the continuous improvement of the database, we will incorporate more correlation parameters associated with the HM of EC into the web predictor to improve its adaptability.

Conclusion

In summary, this study built a machine learning model for predicting liver metastasis of esophageal cancer based on 11 clinicopathological features commonly seen in clinical work, among which GBM model performed best. GBM model can be used to predict liver metastasis of esophageal cancer, and then help clinicians to make more accurate treatment plan for patients with esophageal cancer.