Background

Globally, lung cancer has been one of the most common malignancies worldwide in the last few decades; it has the highest incidence and is the leading cause of death. In 2018, there were approximately 2.1 million new lung cancer diagnoses, accounting for 12% of the global cancer burden [1, 2]. Notably, the 5-year survival rate for patients with lung tumours is low, at 18%. However, if early diagnosis of lung cancer can be achieved, the survival rate can be increased to approximately 55%. It has been reported that patients with early-stage lung cancer have a 5-year survival rate of up to 40% if they receive appropriate treatment [3]. Unfortunately, over 70% of patients are diagnosed when their tumour has progressed to an advanced stage, and most of these cases are not suitable for surgery. This is related to the fact that existing diagnostic methods are not sensitive and accurate enough. The current gold standard for diagnosing lung cancer is CT-guided transthoracic aspiration biopsy; however, it is expensive and carries the risk of pneumothorax, pulmonary embolism and significant trauma. As such, it is unacceptable for most patients. There are many other diagnostic methods, such as blood tumour biomarkers and bronchoscopy, for lung cancer screening, but they still have their own limitations [4, 5]. Therefore, finding valuable diagnostic biomarkers for lung cancer, especially for early-stage lung cancer, is important.

In recent years, the advent of metabolomics has provided insight into many diseases, particularly cancer [6]. Metabolomic studies have been used to identify the metabolic pathways and metabolites that regulate tumour progression and physiological function [7, 8]. These metabolites have been used to assess the clinical features of ovarian tumours [9], renal tumours [10], and pancreatic tumours [11]. Metabolomic analysis is a low-cost, high-throughput blood-based test that is feasible and affordable for early lung cancer screening in high-risk groups compared to other biomarkers, including molecular targets, in lung cancer patients [12]. Therefore, for lung cancer, more methods need to be explored to analyse biomarkers with specificity and sensitivity in metabolites.

There are many surprising benefits of applying machine learning techniques in the medical field. Machine learning models use computers to analyse, model and train a large amount of medical data to reveal the relationship between various medical indicators. This method involves great computational power in a short time. At the same time, it can also predict and assist in the diagnosis of diseases through the trained model, which can improve the accuracy of diagnosis [13]. In cancer, machine learning has been used to explore survival and prognosis prediction models for pancreatic, bladder, advanced nasopharyngeal and breast cancers [14,15,16]. Among these, XGBoost models have been applied to identify lung cancer, colon cancer subtypes [17], prediction of lung metastases from thyroid cancer [18] and risk models for identifying lung cancer [19], with all performing at a high level. In the last decade, nomograms have been considered a reliable method for predicting tumour prognosis [20]. They have been applied to prognosis prediction of many cancers, including gastric cancer, breast cancer and testicular cancer [21,22,23,24]. However, the combined application of the XGBoost model and nomogram for prediction of early-stage lung cancer has not yet been reported.

Methods

Source of data and participants

The study participants were recruited from April 2018 to December 2020 at the Department of Thoracic Surgery and Respiratory of the Second Affiliated Hospital of Dalian Medical University (Dalian, China). A total of 478 patients diagnosed with lung cancer and 370 subjects with benign lung nodules (tuberculoma, hamartoma, and inflammatory pseudotumor) were retrospectively recruited. This research protocol was approved by the ethics committee of a hospital in Dalian and is in line with ethical and safe research practices involving human subjects or blood.

Blood samples were collected from all participants enrolled in the study after overnight fasting. For amino acid- and carnitine-targeted metabolomic profiling, LC‒MS/MS was used to test serum samples for 20 amino acids (Ala, Arg, Asn, Asp, Cit, Gln, Glu, Gly, His, Leu, Lys, Met, Orn, Phe, Pro, Ser, Thr, Trp, Tyr, Val) and 27 carnitines (C0, C2, C3, C4, C4OH, C4DC, C5, C5OH, C5DC, C5:1, C6, C6DC, C8, C10, C12, C14, C14OH, C14DC, C141, C16, C16OH, C161OH, C18, C20, C22, C24, and C26). LC‒MS/MS was carried out using an API 3200 quadrupole mass spectrometer (Applied Biosystem, USA) equipped with an electrospray ionization (ESI) probe and Chemo View 1.4.2 and Agilent 1200 high-performance liquid chromatography (Agilent Technologies, USA) at the Dalian Institute of Chemical Physics, Chinese Academy of Sciences, within 48 h of sampling.

The inclusion criteria for patients with lung cancer were as follows.

  1. 1)

    Patients with stage I-II lung cancer according to the eighth edition of the American Joint Committee on Cancer (AJCC8th) tumour-node-metastasis (TNM) staging system.

  2. 2)

    Patients not receiving antineoplastic therapy, radiotherapy or chemotherapy prior to surgery or cancer diagnosis.

The exclusion criteria for patients with lung cancer were as follows.

  1. 1)

    Patients with incomplete medical records and missing data.

  2. 2)

    Patients with combined autoimmune diseases, severe cardiac, hepatic and renal diseases, metabolic syndrome, and all other diseases may lead to metabolic disturbances [25,26,27,28,29].

  3. 3)

    Patients with a history of recurrent tumours, metastatic tumours or a combination of other malignancies.

  4. 4)

    Patients with a history of surgery in the past 6 months Fig. 1.

Fig. 1
figure 1

Visual diagram of the detailed screening process for data collection and initial metabolic indicators

Selection of indicator screening algorithms

We use the parameter indicators of the XGBoost machine learning model as a representative and tested the original dataset, the dataset filtered by the stepwise regression algorithm, and the dataset filtered by the Boruta algorithm by comparing its accuracy, precision, recall, F1 score and area under the receiving operating characteristic (ROC) curve (AUC). AUC values were also statistically analysed by the DeLong test to obtain the algorithm with the best parameter metrics as a subsequent method of filtering the dataset. The process is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of the machine learning process and comparison with the nomogram

Nomogram

We used metabolomics filtered by backwards stepwise regression to construct a model for predicting the probability of lung cancer. The selected indices included sex, age, Arg, Asn, Glu, Orn, Ser, Val, C4OH, C12, C16, C22, C26, C4DC, C5, C5DC, C12, C16, C22, and C26. Logistic regression models were generated to investigate the risk of lung cancer. A nomogram was created using R software version 3.0.4 (Fig. 3).

Fig. 3
figure 3

Nomogram of the 16 metrics used to predict lung cancer after screening

Introduction to machine learning algorithms

Random seeds were used to divide the training set and test set by 7:3, and 4 machine learning algorithms were used to compare the accuracy, precision, recall, and F1 score of the model index values. Support vector machines have a wide range of applications in disease prediction due to their high robustness and ability to model nonlinear decision bounds, their many optional kernel functions, and their ability to efficiently learn high-dimensional data. Extreme gradient boosting (XGBoost) is an ensemble learning algorithm based on the classification tree model that combines classifier groups with low accuracy through an iterative calculation method, making it a high-accuracy classifier. Its characteristics are fast running speed, accurate training results, and loose data requirements. Strong model generalization ability, higher scalability, and faster computing speed are its advantages. Random forest analysis randomly builds a large number of classification trees, and the final classification is determined by voting on the classification results of each tree. The KNN algorithm is more effective than other machine algorithms at multiclassification problems, providing doctors with efficient and high-quality analysis for disease diagnosis and improving the accuracy of diagnosis. SVM, KNN, random forest and XGBoost models using the above indicators were constructed in Python 3.7. Receiver operating characteristic (ROC) curves were generated to assess the predictive performance of the nanogram and machine learning models. The Delong test using Python was used to examine differences in diagnostic performance between model differences, with P < 0.05 indicating a statistically significant difference. The specific process is shown in Fig. 2.

Results

Data description

This study collected 848 patients who visited a hospital in Dalian between 2018.04.06 and 2020.12.15. Among them, there were 478 patients with early-stage lung cancer and 370 patients with benign nodules in the lung. The training and test sets were divided according to a random seed 7:3. In the training set, 339 people were diagnosed with lung cancer, of whom 127 (37.5%) were male and 212 (62.5%) female, with a statistically significant difference (p < 0.001). The mean age of the lung cancer group was 62.2 years, and the mean age of the control group was 56.2 years, with a statistically significant difference (p < 0.001). In the test set, 139 patients were diagnosed with lung cancer, of whom 59 (42.4%) were men and 80 (57.6%) women, though the difference was not statistically significant (P > 0.05). The mean age of the patients with lung cancer was 61.8 years, and the mean age of the controls was 53.4 years. The lung cancer group was older than the control group, with a statistically significant difference (P < 0.001). The characteristics of the amino acid and carnitine targeted metabolome analyses are also presented in Table 1. In Table 2, we show statistical test analysis results for each index in the training and test sets; except for C414, which had a p value < 0.05, all the other indices were p > 0.05, indicating that the difference between the two groups was not statistically significant. Therefore, index C414 was excluded, and the remained were examined in the next step.

Table 1 Sociodemographic and pathological characteristics of patients and controls in the case and control groups in the training and validation sets
Table 2 Statistical test results for each index in the training and test sets

Performance comparison of data index screening algorithms

In this study, we used the XGBoost model as a representative and applied two algorithms for data feature screening, namely, stepwise regression and Boruta. The stepwise regression algorithm is a traditional statistical feature screening method; the basic idea is to reduce the degree of multicollinearity by eliminating variables that are less important and correlate highly with other variables. The Boruta algorithm is a popular feature screening method in machine learning. It is based on the same idea as the random forest classifier, that is, adding randomness to the system and collecting results from random sample sets can reduce the misleading effects of random fluctuations and correlations. We also used the original dataset as a control group. The results are shown in Table 3. In the original dataset, all 49 features were used. After Boruta algorithm screening, 19 features were included, and after stepwise regression algorithm screening, 16 features were included. In terms of the number of included features, the number of features filtered by the stepwise regression algorithm was lowest, which can simplify the subsequent operation process and shorten the operation time. Comparing the accuracy, precision, F1 score and recall index, the accuracy of stepwise regression was 75.29%, the accuracy of Boruta was 72.55%, and the accuracy of the original dataset was 73.73%. The area under the receiver operating characteristic (ROC) curve (AUC) value was 0.79 for the original dataset, 0.78 for Boruta, and 0.81 for the stepwise regression (Fig. 4). The DeLong test results showed that among the three algorithms, the differences in AUC values between the stepwise regression algorithm and Boruta and the original dataset were statistically significant (p < 0.01, p < 0.05) (Fig. 5). Using the XGBoost model as a representative test, the stepwise regression algorithm had the highest accuracy and lowest number of filtered features. Therefore, we next selected 16 features filtered by stepwise regression as the dataset: sex, age, Arg, Asn, Glu, Orn, Ser, Val, C4OH, C4DC, C5, C5DC, C12, C16, C22 and C26.

Table 3 Performance comparison of three indicator screening algorithms represented by the XGBoost algorithm
Fig. 4
figure 4

ROC curve of the dataset after filtering by the three indicator screening algorithms. (AUC = area under the ROC curve, ROC = receiver operating characteristic.)

Fig. 5
figure 5

Visualisation of the results of the mutual Delong test for the three indicator screening algorithms

Performance metrics comparison of machine learning algorithms

The training and test sets were divided 7:3 using random seeds. Four machine learning algorithms were used to compare the accuracy, precision, completeness and F1 scores of the model metric values. The receiver operating characteristic curve (ROC) was used to determine the strength of the predictive ability by the area under the curve (AUC). The larger the AUC value is, the stronger the predictive ability is. The results in Table 4 show that the accuracy of the XGBoost model was 75.29%, with an AUC of 0.81, better than all the other models. The random forest model had an accuracy of 72.55% and an AUC value of 0.78. The accuracy rate of the support vector machine model was 71.37%, and the AUC value was 0.77. The accuracy rate of the adjacent algorithm model was 66.67%, and the AUC value was 0.69 (Fig. 6). Figure 7 shows the results of the mutual DeLong test for the five models. The statistical results show that of the five models, the differences between the RF, LR, KNN and XGBoost models were statistically significant (p < 0.05); conversely, the differences between the SVM models were not statistically significant (p > 0.05). This may be due to the small study size of this sample. The performance of the SVM model compared to the XGBoost model can be seen in terms of accuracy, precision, F1 score and recall. Based on the combined analysis, the XGBoost model had the best diagnostic performance, with an AUC of 0.81 and an accuracy of 75.29% for the test set.

Table 4 Comparison of the performance metrics of the XGBoost model, random forest model, support vector machine model, logistic regression model, and K-nearest neighbour algorithm model
Fig. 6
figure 6

ROC curves for the XGBoost model, Random Forest model, Support Vector Machine model, Logistic Regression model, and K-Nearest Neighbor algorithm model (AUC = area under the ROC curve, ROC = receiver operating characteristic.)

Fig. 7
figure 7

Visualisation of the mutual Delong test results for the 5 machine learning models

Performance comparison of nomogram and machine learning algorithms

The nomogram was used in the following way: the patient's index level was found on a scale that corresponds to the patient's actual level and projected upwards to the top of the scale (points) to obtain the score for each variable, which was summed to give the total points. The total points were summed to give the total points and projected downwards to give the patient's risk of lung cancer. The total score for one patient was 1080, which corresponds to a risk of lung cancer of 87.1%. The case results confirm that the patient had lung cancer. Figure 8. The nomogram showed an accuracy of 68.24%, a sensitivity of 0.71, and a specificity of 0.64. The machine learning model (XGBoost) showed 75.29% accuracy, 0.74 sensitivity, and 0.76 specificity. As indicated in Table 5, the XGBoost model was better than the nomogram in terms of parameter index performance. In the subsequent index feature importance ranking, we applied the XGBoost model for testing.

Fig. 8
figure 8

Application of Nomogram model for predicting lung cancer incidence in patients

Table 5 Comparison of performance metrics for the nomogram and machine learning models

Index importance score ranking

The XGBoost model was used to score the importance of the 16 included indicators, as depicted in Fig. 9. The order of importance was Orn, Val, C16, Arg, Asn, Glu, Ser, age, C4DC, C5DC, C5, C22, C4-OH, C12, C26, and sex. In the amino acid category, the most important index was ornithine; in the carnitine category, the most important feature was palmitoylcarnitine.

Fig. 9
figure 9

Importance ranking of indicators affecting lung cancer occurrence derived from applying the XGBoost model predictions

Discussion

In this study, we used a nomogram and 4 machine learning algorithms to build a model for predicting early-stage lung cancer by amino acid and carnitine indicators. For the 47 kinds of metabolic indices in human serum and 2 kinds of clinical indices of age and sex in clinical data, the backwards stepwise regression algorithm was used to finally screen out 16 indices, which were included in the next step to establish a prediction model. Because data were extracted from retrospective cases, only two demographic indicators, age and sex, were included in this study. Finally, the XGBoost model in the machine learning algorithm was shown to have superior predictive ability. Notably, previous studies have shown that metabolites are relatively strong objective predictors of lung cancer, and 8 acylcarnitines (C16, C4DC, C5DC, C5, C22, C4-OH, C12, C26) were included in our model. Carnitine acts as a shuttle, bringing long-chain fatty acids into the mitochondria for oxidation and conversion to acylcarnitines. Excess acylcarnitines are then released into the bloodstream. Studies have shown that fatty acids are synthesized in tumour cells and are associated with cell proliferation and metastasis in lung cancer [30]. Therefore, the acylcarnitine spectrum can reflect the fatty acid metabolic status and related diseases such as lung cancer. C5DC has been previously shown to be involved in the genetic metabolism of neonatal leukaemia or lymphoma [31]. It also serves as a potential screening marker for autism spectrum disorder in children and Alzheimer's disease [32]. However, to date, there has been no direct link between C5DCs and cancer. C5 is a short-chain acylcarnitine (2–5 carbons in length) that is included in several metabolic signatures used to identify risk of endometrial cancer and Alzheimer's disease [33, 34]. C16 is a long-chain acylcarnitine (more than 12 carbons in length) that can be used as a potential novel biomarker for diagnosis of nonalcoholic fatty liver disease. A high correlation (r > 0.7) has been found between even-carbon long-chain acylcarnitines in patients with nonalcoholic fatty liver disease [35]. This study is the first to incorporate C5DC and C16 into a cancer prediction model.

Two amino acids (Arg and Ser) included in our model have been shown to be closely related to biological functions during lung cancer development. Arginine is a semiessential amino acid that acts as a building block for protein synthesis and a precursor for a variety of metabolites, including polyamines and nitric oxide, which have strong immunomodulatory properties in tumours [36, 37]. In addition, cancer cells show elevated levels of Arg [38], and elevated Arg levels induce overall metabolic changes, including activation of T cells from glycolysis to oxidative phosphorylation, promotion of central memory-like cells with higher viability, and antitumour activity in mouse models. Thus, the intracellular arginine concentration directly affects the metabolic fitness and viability of T cells, which are critical for antitumour responses [39]. Ser is a nonessential amino acid that supports a variety of metabolic processes critical for the growth and survival of proliferating cells, including synthesis of proteins, amino acids, and glutathione. As an important one-carbon donor of the folate cycle, Ser contributes to production of NADPH for nucleotide synthesis, methylation reactions and antioxidant defence [40]. Many rapidly proliferating cells depend on exogenous Ser, and depletion of Ser significantly inhibits the growth of some cancer cells in vitro and in vivo [41]. In this study, the amino acid with the highest index importance score was ornithine. Ornithine is a nonessential amino acid and an intermediate molecule in the urea cycle. It is a key substrate for synthesis of proline, polyamines and citrulline. Previous reports have demonstrated that ornithine plays an important role in regulation of several metabolic processes leading to diseases such as hyperuricaemia, hyperammonemia, gyrate atrophy and cancer in humans [42]. It has also been suggested that the ornithine decarboxylase gene may play an important role in lung cancer and that its overexpression may be associated with development and progression of lung cancer [43].

Traditionally, clinicians have made judgements based on patient consultation and past decisions. Therefore, clinician experience plays an important role in accurate risk estimation and decision-making. This approach raises a huge problem, and risk of bias and patient outcomes can be highly subjective [44]. Nomograms have been used to predict survival in various head and neck cancers [45, 46]. Similarly, machine learning models have shown encouraging risk estimates for patients [47, 48]. Therefore, the introduction of nomograms and machine learning models provides clinicians with a new decision aid that can accurately predict patient conditions. In this study, for comparison of the performance parameters of the two methods, the machine learning model (XGBoost) outperformed the nomogram in predicting the occurrence of lung cancer. To our knowledge, this is the first study to compare nomograms and machine learning models to lung cancer. It is worth noting that the visualization of outcome metrics provided by the nomogram solves the problem of not easily interpreting the results of machine learning models. Thus Overall, published studies have shown [45] that the combination of a nomogram-machine learning (NomoML) approach can provide a more transparent approach to individualized assessment and to develop the most appropriate adjuvant treatment regimen for lung cancer patients. In addition to the remarkable accuracy provided by machine learning models, visualization of model results can make overall research more practical.

In this study, there are certain limitations that need to be considered. Due to the particularity of the indicators, there were data collected from the hospital at this time cannot find no matching public data from the public database for the data collected from the hospital; hence, there was a lack of external verification in the model testing process. Additionally, due to the nature of retrospective case data, only two demographic indicators, sex and age, were included. The amount of data in this study also was not sufficient, and efforts should be made to collect more data such that the indicators of the model are more accurate.

In conclusion, this study proposes an interdisciplinary approach combining metabolomics with a machine learning model (XGBoost) to early predict the occurrence of lung cancer. The metabolic biomarkers ornithine and palmitoylcarnitine showed significant diagnostic power for early lung cancer. This study raises new possibilities for replacing invasive detection methods with blood tests in the future. We will also consider performing laboratory studies and prospective experimental studies.