Abstract
This study aimed to establish a machine learning (ML) model for predicting hepatic metastasis in esophageal cancer. We retrospectively analyzed patients with esophageal cancer recorded in the Surveillance, Epidemiology, and End Results (SEER) database from 2010 to 2020. We identified 11 indicators associated with the risk of liver metastasis through univariate and multivariate logistic regression. Subsequently, these indicators were incorporated into six ML classifiers to build corresponding predictive models. The performance of these models was evaluated using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity. A total of 17,800 patients diagnosed with esophageal cancer were included in this study. Age, primary site, histology, tumor grade, T stage, N stage, surgical intervention, radiotherapy, chemotherapy, bone metastasis, and lung metastasis were independent risk factors for hepatic metastasis in esophageal cancer patients. Among the six models developed, the ML model constructed using the GBM algorithm exhibited the highest performance during internal validation of the dataset, with AUC, accuracy, sensitivity, and specificity of 0.885, 0.868, 0.667, and 0.888, respectively. Based on the GBM algorithm, we developed an accessible web-based prediction tool (accessible at https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/) for predicting the risk of hepatic metastasis in esophageal cancer.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
Esophageal cancer (EC), which occupies the ninth position in terms of global cancer prevalence, is the sixth most common cause of cancer mortality1. Annually, it is responsible for the demise of over half a million individuals worldwide2,3. From a histological viewpoint, the disease mainly bifurcates into esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC), each exhibiting unique patterns of metastasis that typically manifest at different stages of the disease progression4. Due to the predominantly asymptomatic nature of the early stages, esophageal cancer diagnosis often occurs at an advanced phase, where it is commonly accompanied by distant metastatic spread5. Patterns of metastasis in esophageal cancer can be classified into three major types: lymphatic, hematogenous, and direct diffusion. The latter typically becomes evident in the advanced stages, marked by tumor invasion into adjacent structures following penetration through the esophageal adventitia. Hematogenous metastasis is primarily secondary to lymph node involvement, facilitating the tumor’s spread to distant organs via the vascular system6. The lymphatic pathway, however, is recognized as the principal vector for metastatic dissemination in esophageal cancer, critically affecting patient prognosis and contributing to pertinent prognostic considerations7,8.
Metastatic sites of esophageal cancer encompass the liver, brain, lungs, bones, and others. However, liver metastasis in esophageal cancer engenders a substantial impact on patient prognosis. Not only does it signal advanced-stage disease, but it also portends a poor prognosis, resulting in metabolic disorders due to liver dysfunction, circulatory problems stemming from liver failure, pain, weight loss, and the potential development of multiple organ dysfunction syndrome (MODS) in the advanced stages25.
In light of this, advanced machine learning (ML) models were employed in this study. In comparison to traditional logistic models, machine learning techniques unlock richer information within extensive datasets, thus achieving superior outcome prediction accuracy10. ML technology has already found wide-ranging applications in science and society, ranging from driverless cars to board games to decision-making processes11. In the field of biomedicine, the emergence of big data in healthcare12,13 presents tremendous potential for ML to comprehend disease and health. Consequently, ML has been integrated into clinical diagnostics, precision therapeutics, and health monitoring14.
Given that patients with esophageal cancer exhibit varying clinical-pathological stages and receive different treatments, prognostic outcomes also differ significantly. Unfortunately, limited research currently focuses on hepatic metastasis metastasis in advanced esophageal cancer, thereby posing challenges for clinical decision-making among physicians24. Therefore, the objective of this research is to formulate and validate a machine learning model characterized by its strong predictive capabilities, and to integrate this model into an accessible web-based tool designed to facilitate the prediction of liver metastasis risk in individuals diagnosed with esophageal cancer.
Materials and methods
Study population
In the study, we used SEER*stat 8.4.1 software to download the patients’ data from the SEER database. Patients diagnosed with esophageal cancer (SCC and AC) between 2010 and 2020 were involved in this study. Exclusion criteria were detailed as follows: (1) Excluded unknown bone, brian, liver and lung metastatic status; (2) Excluded unknown AJCC T, N stage; (3) Excluded unknown race and histology grade; (4) Excluded unknown primary site; (5) Excluded unknown Histologic Type and Surgery; (6) Excluded unknown marital status. A study flow chart of case screening was presented in Fig. 1.
Data selection
In this study, 16 variables related to the clinicopathology and demographics of patients were selected for analysis. Demographic variables included age, sex, marital status, race. Clinicopathological variables included primary site, tumor histology, tumor grade, T stage, N stage, surgery, radiation, chemotherapy, brain metastasis, bone metastasis, lung metastasis, liver metastasis. According to the ICD-O-3 codes, histological types of esophageal cancere divided into 2 categories, including adenocarcinoma (8140–8573), squamous cell carcinoma (8050–8082). All esophageal cancer patients were staged according the AJCC 8th edition guidelines and SEER staging information. In addition, X-tile software was used to calculate cut-off value of age.
Data pre-processing and feature engineering
All statistical analyses were conducted with Python3.8, SPSS 23. In this study We performed a logistic regression analysis on data collected in the SEER database to identify suitable variables for machine learning model by using SPSS 23 software. Significant variables from HM patients were identified by univariate logistic regression analysis (P < 0.05). Then, these variables were enclosed within multivariate logistic regression analysis, and variables with a P < 0.05 in multivariate logistic regression analysis were subjected for further analysis of ML model. Correlation analysis was used to analyze the correlation among the selected features. Since this data set is an unbalanced data set, the over-sampling method were adopted for data processing15. The key of this method is to oversampling the data samples of small classes to increase the number of data samples of small classes to improve the accuracy of the model. Meanwhile, to compare the importance of each feature, we extract the feature importance of each variable in the machine learning model according to the Permutation Importance principle16,17.
Model establishment and evaluation
Data from the SEER database were randomly assigned to train set and internal test set in a ratio of 3:7. Six commonly used classifier algorithms were chosen to this study, including three ensemble algorithms11 Random Forest (RF), Gradient Boosting ine (GBM), eXtreme gradient boosting (XGB) and three simple classification algorithms Logistic Regression (LR), Decision tree (DT), Naive Bayes classifiers (NBC). The ML models were trained using Python software. In the training group, all SEER data was divided into 10 parts for 10 × cross-validation20. For the internal test group data is directly imported into the built model for verification. The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy and F-score were evaluated indicators of ML algorithms. The probability density plot and clinical utility curve (CUC) was utilized to examine clinical applicability. Furthermore, based on the best-performing model, we built a web-based online calculator.
Ethical disclosure statement
The authors stated that no human or animal experiments were adopted in this study.
Results
Clinical characteristics of patients
In evaluating the train (N = 12,460) and test (N = 5340) sets of esophageal cancer patients, no significant differences were observed in terms of age distribution, sex, marital status, race, tumor characteristics, and treatments received, with P-values exceeding 0.05 for all compared variables. The most common tumor location was the lower third of the esophagus, and adenocarcinoma was the prevalent histology type. The rates of the various interventions and metastases, including liver metastasis (9.3% in training vs. 9.0% in testing), were similarly distributed between the two sets, indicating a well-matched cohort for further predictive analysis (Table 1).
Univariable and multivariable logistic regression analysis
11 risk factors associated with hepatic metastasis including age, primary site, tumor histology, tumor grade, T stage, N stage, surgery, radiation, chemotherapy, bone metastasis, lung metastasis were identified using univariable and multivariable LR analysis (P < 0.05, Table 2). Based on these risk factors, we developed six different models using machine learning (ML) algorithms in this study.
Correlation analysis and Importance of features on prediction
In order to assess the level of correlation between factors, correlation analysis is commonly employed. In this study, we utilized Spearman correlation analysis to examine the independence between data features. A correlation heat map was generated, as depicted in Fig. 2A, which depicted the absence of significant correlation among the 15 features under investigation. Figure 2B presents the significance of features extracted from each machine learning algorithm. The variables identified through univariate and multivariate logistic analysis have all played a remarkable role in predicting outcomes across the six models. Notably, surgery consistently emerged as the most influential feature in the majority of prediction models, underscoring its significant impact on hepatic metastasis in esophageal cancer. In most algorithms, T stage, age, primary, N stage and tumor grade ranked the last five, with no significant difference in their contributions to the model. Lung metastasis, radiation, bone metastasis, histology, chemotherapy, T stage, age, primary, N stage and tumor grade are arranged in descending order in GBM model.
Model performance
The performance of the six predictive models is described in Fig. 3A,B and Table 3. Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models with an average AUC of 0.893, followed by the LR model (AUC = 0.882). Internal test validation was shown in Table 3 and Fig. 3B. Interestingly, the GBM model also achieves the best AUC score (0.885) in the internal test validation and the score of accuracy, sensitivity (recall rate) and specificity were 0.868, 0.667 and 0.888, respectively. The confusion matrix (Fig. 3C) of the GBM model in the training set and the test set indicated its high accuracy. The probability density plot (Fig. 3D) depicting predictive distribution showed that the AUC was highest when the predictive score was 0.38. The CUC plot (Fig. 3E) also showed good clinical applicability.
Web predictor
This study aimed to develop a web predictor utilizing the GBM model, which exhibited superior predictive performance for hepatic metastasis in patients with esophageal cancer. The primary objective of this web predictor is to provide doctors with a valuable tool for making more precise clinical decisions. By inputting the relevant variables associated with hepatic metastasis into the web predictor, healthcare professionals can conveniently calculate the odds of hepatic metastasis in patients with esophageal cancer. For easy access, the web predictor can be accessed at the following link: (https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/). Please refer to Fig. 4 for further details.
Discussion
Esophageal cancer is a remarkably fatal malignancy, with a prevalence of distant metastases reaching up to 42% in newly diagnosed patients, prominently affecting the liver as the most frequently involved organ26,27,28. The effective treatment and comprehensive management of metastatic esophageal cancer necessitate a multimodal strategy, which continues to pose significant challenges. Therefore, it is of crucial significance for clinical decision-making to identify high-risk factors of esophageal cancer and accurately predict whether patients will develop liver metastasis based on their individual and unique clinical and pathological characteristics.
Currently, the HM of advanced esophageal cancer remains understudied in the scientific literature. Prognostic research in this domain is predominantly focused on two key aspects. Firstly, there is a conspicuous paucity of exploratory investigations into the high-risk prognostic factors associated with esophageal cancer. Additionally, further exploration of the interrelationships among these independent prognostic factors is noticeably lacking. Secondly, there is a dearth of research on HM models for advanced esophageal cancer that leverage the immense potential of big data. Consequently, there is an urgent need for comprehensive studies in these areas to contribute to an improved understanding and accurate prognostication of advanced esophageal cancer.
Some studies believe that smoking and drinking are the most common risk factors for male esophageal cancer29. Some previous studies30 have also shown that for cancer patients, the degree of tissue differentiation, pathological N-stage, vascular invasion, and neuroinvasion are recognized factors that affect the prognosis of patients with esophageal cancer31,32,33,34. The conclusions of these studies lacked the support of big data and did not address the prediction on HM of advanced esophageal cancer. Based on big data analysis of SEER database, our study screened out independent high risk factors associated with HM by logistic regression analysis. This study included 15 clinically common relevant factors associated with advanced esophageal cancer with liver metastasis, which are: age, sex, Marital status, Race, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Brain metastasis, Bone metastasis, Lung metastasis. To identify the independence between features, we obtained a correlation heat map by Spearman correlation analysis. There was no strong correlation among these 15 features by the Fig. 2A. Moreover, 11 independent high risk factors related to liver metastasis were screened by logistic regression analysis, which were as follows: age, Primary Site, Tumor histology, Tumor grade, T stage, N stage, Surgery, Radiation, Chemotherapy, Bone metastasis, Lung metastasis.
Undoubtedly, the construction of prediction models for HM of advanced esophageal cancer is equally significant to the exploration of independent high risk factors in this context. Presently, there is a notable dearth of studies focused on risk factors in esophageal cancer patients with distant organ metastases35. For instance, Tang et al. previously constructed a nomogram to predict the survival of patients with metastatic esophageal cancer; however, this study encompassed metastases to all anatomical sites, without specifically exploring a prediction model for predicting the risk of distant metastasis36. Similarly, Cheng et al. established models for predicting both the risk and survival of esophageal cancer patients, albeit those specifically tailored to brain metastasis37. Furthermore, Guo et al. provided detailed characteristics and explored risk and prognostic factors for patients with liver metastasis, yet they did not develop any predictive tools38. Considering that liver metastasis represents the most common site of distant spread, conducting a comprehensive investigation specifically targeting esophageal cancer patients with liver metastasis assumes paramount clinical importance.
Previous studies have constructed nomograms to predict EC metastasis based on traditional logistic models. However, the limitations of this method in prediction accuracy and processing big data have made it difficult to make great breakthroughs in precision medicine9,10. And traditional research cannot exploration the interaction between different independent high risk factors18,19. In contrast, our study can better document complex associations between different independent high risk factors, thereby improving the accuracy of the model20. Previous studies have used nomogram methods to build a model for predicting the metastasis of patients with esophageal cancer based on the data of patients with esophageal cancer in the SEER database, but these studies did not involve the establishment of a predicting model for HM of advanced metastatic esophageal cancer by ML21.
We then constructed six prediction models using ML, Internal ten-fold cross-validation (Fig. 3A) showed that GBM model performed best among the six models. Leveraging these findings, we have successfully devised an openly accessible online calculator (https://project2-dngisws9d7xkygjcvnue8u.streamlit.app/) based on the GBM model. The model we have developed accurately predicts patients' risk of HM based on various clinical indicators. Clinicians can access this model through the provided website to input patient information and obtain corresponding predictions of hepatic metastases, thereby facilitating clinical decision-making.
Our research has the following advantages. Firstly, this study established a statistical model based on machine learning that can predict the HM of patients with EC. To the best of our knowledge, we are the first to use ML to construct a prediction model of LM of EC. This model is more reliable than the traditional nomogram prediction model. And this work expanded our knowledge of advanced EC. Second, our study further explores the relationship between different independent high risk factors, which provides a new direction for future clinical research. In other words, clinical research should not only explore the metastasis of patients, but also explore the correlation between different independent high risk factors, so as to better find the relationship between these factors and further eliminate the factors that are not conducive to the metastasis of patients during perioperative period.
Meanwhile, this study has some limitations. First, Current machine learning is almost entirely statistical or black-box, bring severe theoretical limitations to its performance23. Second, this study is a single-center study with limited number of patients included, and the application of machine learning model on large data sets can obtain more stable results22. Therefore, in subsequent studies, multi-center data can be added for training and external verification, so as to obtain a more reliable prediction model. Third, this study did not include neoadjuvant therapy, surgical methods, circulating tumor DNA and other factors that may affect the long-term prognosis of patients with esophageal cancer. In the future, with the continuous improvement of the database, we will incorporate more correlation parameters associated with the HM of EC into the web predictor to improve its adaptability.
Conclusion
In summary, this study built a machine learning model for predicting liver metastasis of esophageal cancer based on 11 clinicopathological features commonly seen in clinical work, among which GBM model performed best. GBM model can be used to predict liver metastasis of esophageal cancer, and then help clinicians to make more accurate treatment plan for patients with esophageal cancer.
Data availability
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
References
Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 71, 209–249 (2021).
Lagergren, J., Smyth, E., Cunningham, D. & Lagergren, P. Oesophageal cancer. Lancet 390, 2383–2396 (2017).
Intenational Agency for Research on Cancer (IARC). Global Cancer Observatory (Globocan).
Uhlenhopp, D. J., Then, E. O., Sunkara, T. & Gaduputi, V. Epidemiology of esophageal cancer: Update in global trends, etiology and risk factors. Clin. J. Gastroenterol. 13, 1010–1021 (2020).
Huang, F. L. & Yu, S. J. Esophageal cancer: Risk factors, genetic association, and treatment. Asian J. Surg. 41, 210–215 (2018).
Koizumi, W. et al. Successful resection of pancreatic metastasis from oesophageal squamous cell carcinoma: A case report and review of the literature. BMC Cancer 19, 320. https://doi.org/10.1186/s12885-019-5549-9 (2019).
Isono, K., Sato, H. & Nakayama, K. Results of a nationwide study on the three-field lymph node dissection of esophageal cancer. Oncology 48, 411–420 (1991).
Xi, K., Chen, W. & Yu, H. The prognostic value of log odds of positive lymph nodes in early-stage esophageal cancer patients: A study based on the SEER database and a Chinese cohort. J. Oncol. 2021, 8834912 (2021).
Deo, R. C. Machine learning in medicine. Circulation 132, 1920–1930 (2015).
Goecks, J., Jalili, V., Heiser, L. M. & Gray, J. W. How machine learning will transform biomedicine. Cell 181, 92–101 (2020).
Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144 (2018).
Aarestrup, F. M. et al. Towards a European health research and innovation cloud (HRIC). Genome Med. 12, 18 (2020).
Zhuang, Y., Chen, Y. W., Shae, Z. Y. & Shyu, C. R. Generalizable layered blockchain architecture for health care applications: Development, case studies, and evaluation. J. Med. Internet Res. 22, e19029 (2020).
Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: Challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Solihah, B., Azhari, A. & Musdholifah, A. Enhancement of conformational b-cell epitope prediction using CluSMOTE. PeerJ Comput. Sci. 6, e275. https://doi.org/10.7717/peerj-cs.275 (2020).
Tian, H. et al. Application of machine learning algorithms to predict lymph node metastasis in early gastric cancer. Front. Med. (Lausanne) 8, 759013. https://doi.org/10.3389/fmed.2021.759013 (2021).
Liu, W.-C. et al. Application of machine learning techniques to predict bone metastasis in patients with prostate cancer. Cancer Manag. Res. 13, 8723–8736. https://doi.org/10.2147/cmar.S330591 (2021).
Liu, X. et al. Construction and verification of prognostic nomogram for early-onset esophageal cancer. Bosn J. Basic Med. Sci. 21(6), 760–772 (2021).
Tang, X. et al. A novel nomogram and risk classification system predicting the cancer-specific survival of patients with initially diagnosed metastatic esophageal cancer: A SEER-based study. Ann. Surg. Oncol. 26(2), 321–328 (2019).
Buch, V. H., Ahmed, I. & Maruthappu, M. Artificial intelligence in medicine: Current trends and future possibilities. Br. J. Gen. Pract. 68(668), 143–144 (2018).
Domper Arnal, M. J., Ferrández Arenas, Á. & Lanas, A. Á. Esophageal cancer: Risk factors, screening and endoscopic treatment in Western and Eastern countries. World J. Gastroenterol. 21(26), 7933–7943 (2015).
van der Ploeg, T., Austin, P. C. & Steyerberg, E. W. Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Med. Res. Methodol. 14, 137 (2014).
Hu, C. et al. Diagnostic and prognostic nomograms for bone metastasis in hepatocellular carcinoma. BMC Cancer 20, 494. https://doi.org/10.1186/s12885-020-06995-y (2020).
Gong, X. et al. Application of machine learning approaches to predict the 5-year survival status of patients with esophageal cancer. J. Thorac. Dis. 13(11), 6240–6251 (2021).
Luo, P. et al. The risk and prognostic factors for liver metastases in esophageal cancer patients: A large-cohort based study. Thorac. Cancer 13(21), 1 (2022).
Ajani, J. A. et al. Esophageal and esophagogastric junction cancers, version 2. 2019, NCCN clinical practice guidelines in oncology. J. Natl. Compr. Cancer Netw. 17, 855–883 (2019).
Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2017. CA Cancer J. Clin. 67, 7–30 (2017).
Wu, S. G. et al. Patterns of distant metastasis between histological types in esophageal cancer. Front. Oncol. 8, 302 (2018).
Li, S. et al. Changing trends in the disease burden of esophageal cancer in China from 1990 to 2017 and its predicted level in 25 years. Cancer Med. 10(5), 1889–1899 (2021).
Petrelli, F. et al. Effects of hypertension on cancer survival: A meta-analysis. Eur. J. Clin. Invest. 51(6), e13493 (2021).
Gao, A. et al. Prognostic value of perineural invasion in esophageal and esophagogastric junction carcinoma: A metaanalysis. Dis. Mark. 2016, 7340180 (2016).
Shahbaz Sarwar, C. M. et al. Esophageal cancer: An update. Int. J. Surg. 8(6), 417–422 (2010).
Yang, J. et al. Relationship of lymphovascular invasion with lymph node metastasis and prognosis in superficial esophageal carcinoma: Systematic review and meta-analysis. BMC Cancer 20(1), 176 (2020).
Gupta, V. et al. Survival prediction tools for esophageal and gastroesophageal junction cancer: A systematic review. J. Thorac. Cardiovasc. Surg. 156(2), 847–856 (2018).
Ai, D., Chen, Y., Liu, Q., Deng, J. & Zhao, K. The effect of tumor locations of esophageal cancer on the metastasis to liver or lung. J. Thorac. Dis. 11, 4205–4210 (2019).
Tang, X. et al. A novel nomogram and risk classification system predicting the cancerspecific survival of patients with initially diagnosed metastatic esophageal cancer: A SEER-based study. Ann. Surg. Oncol. 26, 321–328 (2019).
Cheng, S., Yang, L., Dai, X., Wang, J. & Han, X. The risk and prognostic factors for brain metastases in esophageal cancer patients: An analysis of the SEER database. BMC Cancer. 21, 1057 (2021).
Guo, J. et al. Lung metastases in newly diagnosed esophageal cancer: A population-based study. Front. Oncol. 11, 603953 (2021).
Acknowledgements
We thank all patients, investigators, and institutions involved in this study, especially the SEER database.
Author information
Authors and Affiliations
Contributions
Jun Wan and Yukai Zeng designed the experiments. Jun Wan collected and processed the data. Yukai Zeng wrote and polished article. All of the authors read and approved the final manuscript. All authors contributed to data analysis, drafting or revising the article, have agreed on the journal to which the article will be submitted, gave final approval of the version to be published, and agree to be accountable for all aspects of the work. The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wan, J., Zeng, Y. Prediction of hepatic metastasis in esophageal cancer based on machine learning. Sci Rep 14, 14507 (2024). https://doi.org/10.1038/s41598-024-63213-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-63213-6
- Springer Nature Limited