Introduction

Lung cancer is the most common type of cancer in terms of cancer-related death worldwide1. Non-small-cell lung cancers (NSCLCs) accounts for roughly 85% of all lung cancers based on pathomorphology2. Importantly, 35% of the NSCLCs are diagnosed as LUAD. In spite of the improvement of treatment, 5-year survival rate of NSCLC is poor3. Nearly 30% of stage I NSCLC patients would undergo recurrent disease, and many of them would die due to cancer recurrence after surgical therapy3,4. Therefore, exploring novel prognostic hallmarks could assist clinicians in prognostic evaluations and therapeutic selection for early stage LUAD patients.

Growing researches showed that specific molecules could function as prognostic markers for lung cancer. For example, Wang et al. found potential diagnostic and prognostic biomarkers of circular RNAs for lung cancer in China5. Ning et al. suggested that CPSF3 was a promising prognostic biomarker and predicted recurrence of NSCLC6. Liu et al. revealed lncRNA SLC16A1-AS1 as a novel prognostic hallmark in NSCLC7. Zhang et al. identified six metabolic genes as potential biomarkers for LUAD8. Meanwhile, emerging studies have demonstrated that epigenetics plays a critical role in the initiation, progression, therapeutic response, and result of human tumors9,10. DNA methylation serves as a significant epigenetic modification of cancer cells, which may be the main mechanisms of the inactivation of cancer-associated suppressor genes in the process of tumorigenesis11. Various studies suggested that DNA methylation was closely related to the development, invasion, and metastasis of carcinoma, and may act as a hallmark of prognosis12,13. For example, Guo et al. revealed a five-DNA methylation signature as an effective prognostic hallmark in patients with ovarian serous cystadenocarcinoma14. Li et al. suggested that a four-DNA methylation signature served as a robust prognostic biomarker for survival of patients with gastric cancer15. In addition, a previous study reported that methylation was more likely to occur for DNA in tumor cells than that in normal cells16. Nielsen et al. indicated that aberrant DNA methylation was a relatively stable as well as potentially reversible therapeutic signal17. Therefore, DNA methylation has great potential as biomarker of prognosis for cancer patients. However, few studies have investigated the value of DNA methylation for prognostic prediction in patients with stage I–II LUAD. The identification of novel prognostic DNA methylation hallmark for stage I–II LUAD patients was highly desired.

We analyzed the genome-wide methylation map of stage I–II LUAD patients in TCGA database to investigate a DNA methylation-based predicator and a DNA methylation-associated nomogram. The Kaplan–Meier survival curve and ROC analyses were employed to test the robustness of the 11-DNA methylation signature and the results suggested a good value of our nomogram.

Results

Clinical characteristics of the study populations

A total of 393 stage I–II LUAD samples were enrolled in the present study. The demographics and clinical features of the included patents are summarized in Table 1. Workflow of model generation and subject enrolment was summarized in Fig. 1.

Table 1 Clinical features of enrolled samples.
Figure 1
figure 1

Workflow of model generation and subject enrolment.

Methylation markers associated with OS of stage I–II LUAD patients

We identified 2332 differentially expressed methylation sites between dead and alive cohorts that were adopted for univariate Cox proportional hazard regression analysis. Consequently, 84 DNA methylation sites were revealed to be closely involved in stage I–II LUAD patients’ OS (P < 0.05). Following this, the identified 84 DNA methylation sites were used for LASSO Cox regression model (Supplementary Table S1) and 25 methylation sites were determined as the candidate methylation sites (P < 0.01) (Fig. 2A,B) that were then used for multivariate Cox proportional hazard regression. Of these, 11 methylation sites were screened as optimal model sites for the prognostic evaluation of patients with stage I–II LUAD based on the multivariate Cox regression analysis. The risk scoring formula was calculated as follows: Risk score = 1.31688*cg00237391 + 8.85449*cg04529955 − 3.3827*cg06393879 − 3.34536*cg11539066 + 2.62432*cg12133048 − 6.18862*cg13600632 + 1.50514*cg13643814 + 2.3872*cg17186803 − 3.74192*cg20546263 + 2.57421*cg24311704 + 1.19749*cg27468419. The scores showed strong associations of poor prognosis with hypermethylation of cg00237391, cg04529955, cg12133048, cg13643814, cg17186803, cg24311704, cg27468419 and the hypomethylation of cg06393879, cg11539066, cg13600632 and cg20546263 sites (Fig. 3).

Figure 2
figure 2

Candidate methylation sites selection on the basis of the LASSO Cox regression model. (A) LASSO Cox regression with L1 regularization. Tenfold cross-validation for tuning parameter selection in the LASSO model via minimum criteria (the 1-SE criteria). (B) The LASSO Cox regression model was employed to determine the most robust hallmarks. LASSO coefficient profiles of the 84 methylation sites. A coefficient profile plot was created against log (lambda) sequence. Vertical line was used at the value selected by using tenfold cross-validation, where optimal lambda resulted in 25 non-zero coefficients.

Figure 3
figure 3

Boxplots of methylation β values against risk group in the entire TCGA dataset. “High Risk” and “Low Risk” represent the high-risk and low-risk samples, respectively. The median risk score was applied as a cutoff. Vertical coordinates represent the β-value of 11-DNA methylation sites respectively. Mann–Whitney U test was employed to assess the diferences between the high-risk score and low-risk score groups.

Association between the 11-DNA methylation signature and stage I–II LUAD patients’ OS

The prognostic value of the selected 11-DNA methylation hallmark was assessed on the basis of Kaplan–Meier survival analyses in the training, validation and the all-cohort datasets. The median risk scores were employed as cutoff value to divide the patients into low and high risk groups. As exhibited in Fig. 4A, OS was significantly longer in the low-risk cohort in comprison to the high-risk cohort in the training dataset (P = 7e − 07). A similar result was found in the validation dataset (P = 1e − 08) (Fig. 4C) and in the all-cohort dataset (P = 6e − 14) (Fig. 4E). The results implied a great robustness of the predicator as prognostic indicator in patients with stage I–II LUAD.

Figure 4
figure 4

Kaplan–Meier and ROC analysis of patients with stage I–II LUAD in in training, validation and whole datasets. (A, C, E) Kaplan–Meier analysis for stage I–II LUAD patients between the low-risk and high-risk. The x-axis is follow-up time, the y-axis is OS. The log-rank test indicated the higher risk scores were significantly correlated with worse OS (P < 0.05). (B, D, F) 1-, 3-, 5-year ROC curves of the 11-DNA methylation signature. Blue line, green line and red line represent 1-, 3-, 5-year ROC curves respectively.

Evaluation of the predictive ability of the 11-DNA methylation biomarker by ROC analysis

We performed ROC analysis to further evaluate the value of the 11-DNA methylation biomarker. The AUC of the 11-DNA methylation hallmark at 1, 3, 5 years in training set were 0.849, 0.879, 0.831, respectively (Fig. 4B). A good value was also found in validation set (0.742, 0.807, 0.904, respectively) (Fig. 4D) and the whole validation set (0.747, 0.818, 0.870, respectively) (Fig. 4F), the results showed that the 11-DNA methylation biomarker had a significant capacity for predicting OS of stage I–II LUAD patients.

Then, the patients enrolled in this study were ranked based on their risk scores (Fig. 5A), and the dotplot was drawn in based on their recurrence status (Fig. 5B). The outcomes showed that the low-risk cohort had a longer OS than that in the high-risk cohort. Heatmap of 11 methylation sites distribution on the basis of risk score was observed in Fig. 5C, which had a similar result to our previous boxplot. Finally, we performed subgroup analysis by using several clinicopathological factors (site, age, stage, gender and smoking history). A good predictive ability of the 11-DNA methylation biomarker was verified in most sub-group (Supplementary Figs. S1, Figs. S2, Figs. S3, Figs. S4 and Figs. S5).

Figure 5
figure 5

Methylation risk score analysis of 393 stage I–II LUAD cases in the entire TCGA dataset. (A) Methylation risk score distribution against the rank of risk score. The red triangle represented the high-risk samples, the green ball represented the low-risk samples. (B) Survival status of stage I–II LUAD patients against the rank of risk score. The green ball referred to alive samples, the red ball referred to the dead samples. (C) Heatmap showed 11 methylation sites profiles in low- and high-risk groups. Each row of the heat map represented a profile of a methylation site.

Implementation of the 11-DNA methylation biomarker-related biological pathways

Single-sample Gene Sets Enrichment Analysis (ssGSEA) was carried out in accordance to TCGA LUAD mRNA dataset by employing GSVA package18 for exploring the 11-DNA methylation signature-based signaling pathways. The median score was employed as the cutoff to divide the patients into low and high risk groups. Top 20 pathways that were more activated in the high-risk samples than that in low-risk samples were exhibited in Fig. 6A (Supplementary Table S2). The exact Pearson correlations between enriched pathways and risk score was presented in Fig. 6B.

Figure 6
figure 6

Exploration of the 11 DNA methylation signature-based biological pathways. (A) Heatmap of top 20 enriched pathways associated with high risk group. Each row of the heat map represented a pathway and each column represented a stage I–II LUAD sample. (B) Correlation graph between risk scores and top 20 pathways. Each red ball represents a pathway, and each transverse line represents one sample. The score characteristic of each sample was shown in the graph.

Nomogram development

To assess whether the 11-DNA methylation biomarker was an independent predictor for OS of stage I–II LUAD patients, we carried out univariate and multivariate Cox model via risk score and a few clinicopathological factors. Hazard ratios (HRs) showed that the 11-DNA methylation hallmark was significantly involved in the OS of stage I–II LUAD patients (P < 0.01, HR 2.60, 95% CI 1.91–3.53) (Table 2), which implied that the 11-DNA methylation hallmark was an independent prognosis classifier. To further predict OS of stage I–II LUAD patients via a quantitative strategy, we developed a nomogram (Fig. 7) based on risk score, age and cancer status. The significance between risk score and other clinical factors was found in Fig. 8A. The value of the nomogram was measured based on C-index (0.787, 95% CI 0.751–0.832), AUC (1, 3, 5-year: 0.770, 0.849, 0.979) (Fig. 8B) and calibration plot (Fig. 8C–E), the results suggested a good value of our nomogram. Besides, calibration plot and decision curve analysis (DCA) (A novel method for evaluating prediction models) suggested that the nomogram had crucial clinical appliance potential for the prognostication of stage I–II LUAD patients’ OS than that in treat all or treat none cohort. Net benefit was proved for stage I–II LUAD patients in 3-year recurrent risks (Fig. 8F). The result indicated that our methylomics-associated nomogram had a strong robustness and may have clinical application potential.

Table 2 Univariate Cox regression analysis and multivariate Cox regression analysis outcome on the basis of risk score and other clinical factors.
Figure 7
figure 7

In order to quantify the risk assessment and survival probability for individual stage I–II LUAD patients, a nomogram was developed in the entire TCGA dataset according to the 11 DNA methylation signature-based risk score, age and cancer status.

Figure 8
figure 8

Analysis of the 11-DNA methylation biomarker-related nomogram in the entire TCGA dataset. (A) The horizontal axis stood for clinical factors, the vertical axis stood for the percentage of importance. (B) 1-, 3-, 5-year ROC curves for the 11-DNA methylation biomarker-related nomogram. (CE) referred to the 1-, 3-, 5-year nomogram calibration curves, respectively. (F) The DCA for the nomogram. The net benefit was plotted versus the threshold probability. The red line represented the nomogram. The blue line represented the treat-all and the green line represented the treat-none.

Comparison of our nomogram with several known prognostic hallmarks

We compared the nomogram of this study with other known prognostic biomarkers to confirm whether the nomogram hallmark had the advantage of stable and great performance. The biomarkers for predicting early stage LUAD patients' OS were employed for the comparison19,20,21,22,23,24,25. As indicated in Fig. 9, the result manifested that the nomogram outperformed a few known prognostic hallmarks. The AUC of the nomogram at 5 years was 0.979, which was obviously larger than that in other biomarkers, suggestive of a greater value for the nomogram in comparison to other signatures in predicting LUAD patients' prognosis.

Figure 9
figure 9

ROC curves illustrated the effectiveness of the methylation-related nomogram and a few known biomarkers in predicting the prognosis of stage I–II LUAD patients. The AUC in Rotunno et al. is 0.810, in Chen et al. (0.656), in Kuo et al. (0.802), in Wu et al. (0.940), in Sun et al. (0.765), in Zhao et al. (0.760), in Sun et al. (0.786).

Discussion

Despite advances in diagnosis and therapy of NSCLC over the past few decades, the 5-year survival rate is still poor3. Thus, it is urgently needed to identify potential hallmarks correlated with the prognosis of NSCLC and develop optimum targeted therapy. With the deepening study of epigenetics, increasing studies have suggested that DNA methylation is crucial to gene regulation and acts as early events of a few tumors. DNA methylation serves as one of the earliest detectable neoplastic alterations, which give it a significant superiority as carcinoma diagnosis and prognosis hallmarks26,27,28. Various studies indicated that DNA methylation could function as predicators for cancer patients. For instance, Zheng et al. revealed a prognostic 11-DNA methylation biomarker for lung squamous cell carcinoma29. Peng et al. identified a DNA methylation signature to improve survival prediction of gastric cancer30.

In the present study, TCGA database were applied to analyze the methylation of stage I–II LUAD. We identified a signature which contained 11 methylation sites (cg00237391 (DEF6: 1stExon), cg04529955 (SLC10A7: Body), cg06393879 (MYT1L: Body), cg11539066 (MIR596: TSS1500), cg12133048, cg13600632 (FAM125B: Body), cg13643814 (CHRNA7: TSS1500), cg17186803 (CN4B: Body), cg20546263 (C3orf33: TSS1500), cg24311704 (MUC21: 1stExon) and cg27468419 (SDR16C6: TSS1500)) and corresponded to ten genes (DEF6, SLC10A7, MYT1L, MIR596, FAM125B, CHRNA7, SCN4B, C3orf33, MUC21, SDR16C6) by differential methylation analysis, Kaplan–Meier survival analysis, ROC analysis, and Cox regression analysis. Interestingly, previous studies have indicated that most of these ten genes were related to cancer, respectively. For example, Liew et al. suggested that DEF6 expression in ovarian carcinoma was associated with poor survival of patients31. Liu et al. indicated that SLC10A7 was involved in prognosis-associated alternative splicing events of cancer32. Zhang et al. revealed the clinical significance of MYT1L gene polymorphisms in Chinese patients with gastric cancer33. Liu et al. reported that miR-596 could modulate melanoma growth via regulating cell survival and death34. Xiang et al. found that CHRNA7 inhibited cell invasion and metastasis of LoVo human colorectal cancer cells by PI3K/Akt signaling35. Dai et al. revealed that miR-424-5p promoted the proliferation and metastasis of colorectal cancer via directly targeting SCN4B36.Yuan et al. identified a novel splice variant of AC3-33 (C3orf33) in breast cancer37. Kai et al. discovered that mucin 21 was a novel, negative immunohistochemical marker for epithelioid mesothelioma for its differentiation from LUAD38. The result showed that 8 of these 10 genes were involved in cancer. We speculated the above 10 genes may be related to OS of LUAD patients. In addition, the result of Fig. 3 suggested that hypermethylation of cg00237391, cg04529955, cg12133048, cg13643814, cg17186803, cg24311704, cg27468419 sites was involved in poor OS and the hypomethylation of cg06393879, cg11539066, cg13600632 and cg20546263 sites was associated with a shorter OS. The prognosis predictive ability of a single methylation site was limited while the combination of multiple methylation sites had a stronger robustness for predicting OS of LUAD patients, which was further verified by Fig. 4B,D,F.

To better explore whether different clinical characteristics have different effects on the 11-DNA methylation signature, we divided the patients based on site, age, stage, gender and smoking history and used the subgroup analysis to detect whether there were differences between ROC curves of different clinical characteristics cohorts. An 11-DNA methylation signature was determined in the present study. The result indicated that the biomarker had a high diagnostic ability for prognosis of stage I–II LUAD patients. We hope that it can assist monitor stage I–II LUAD patients’ OS and also offer some assistance for the in-depth study of stage I–II LUAD. In addition, a few significant superiorities were required to be elaborated in the study. We developed a nomogram based on risk score, age and cancer status to predict 3- and 5-year stage I–II LUAD patients’ OS. The result suggested that AUC at 1, 3, 5 years entire TCGA dataset were 0.770, 0.849, 0.979, respectively, indicating good value of the 11-DNA methylation biomarker in the clinical application, which made our study more valuable. On the other hand, we performed LASSO Cox regression analysis to explore the candidate methylation sites significantly involved in stage I–II LUAD patients’ OS, which can filter the variables between univariate and multivariate Cox analysis. That is to say, the application of LASSO Cox regression model can elevate the predictive value of the 11-DNA methylation biomarker.

Following that, a comparison of the nomogram with other known prognostic predicators exhibited that our nomogram had apparently higher value in the result prediction of LUAD. In addition, the AUC of the nomogram was larger than the 11-DNA methylation biomarker in this study, demonstrating that the combination of the risk score with clinical variables was more significant in comparison to the methylation-correlated signature alone in the prediction of LUAD patients' prognosis. In addition, based on our nomogram, clinicians will be able combine clinical and molecular factors to determine an individualized probability of recurrence for patients with stage I–II LUAD, which represents a major advance in the field of personalized medicine for pulmonary oncology, suggesting that this tool could help clinicians overcome one of the most challenging limitations we face when treating patients with stage I–II LUAD.

Whereas, there were a few limitations that needed to be emphasized in this study. Firstly, an independent external validation set was required to improve the predictive accuracy of DNA methylation biomarker in our study. Secondly, it takes a relatively long time for this predictive marker to be used clinically. Thirdly, our study was a retrospective study from TCGA database, which might produce some amount of selection bias. In addition, the samples of our study contained a mixture of methylomes in a tumor-lymphocytes/subsets, macrophage/subsets, fibroblasts, as well as tumor cells, which may make reproducibility challenge. Furthermore, Illumina's 450 K Platform chip measured 450,000 methylation sites, which was less than 1% of all methylation sites, and this may yield some bias in our study.

Conclusions

In this study, we successfully developed a new DNA methylation prognostic predicator. In addition, a nomogram was developed based on methylation risk score, age and cancer status by a comprehensive analysis of bioinformatics methods, which could be used for the prediction of stage I–II LUAD patients’ OS.

Materials and methods

DNA methylation data of stage I–II LUAD patients

DNA methylation data of patients with stage I–II LUAD from TCGA database that was processed using the Illumina HumanMethylation450 BeadChip (Illumina Inc., CA, USA) and the associated clinical information was retrieved from TCGA database by R TCGAbiolinks package39, notably, Illumina's 450 K Platform chip analyzed 450,000 methylation sites, which was less than 1% of all methylation sites, and this may yield some bias in our study. The DNA methylation were evaluated via β-values and calculated as the proportion of M and M + U + 100, in which M represented the signal from methylated beads, and U referred to the signal from unmethylated beads targeting CpG site. Data for a total of 393 stage I–II LUAD patients containing 485,577 DNA methylation sites were enrolled after removing patients for whom clinical survival information was not available. We analyzed the correlation between DNA methylation levels and the related OS of patients with stage I–II LUAD. We randomly divided the total stage I–II LUAD patients into two parts: training group (70%) and testing group (30%). The training group was used for model construction and the testing group for assessing the value of the model. LASSO is a critical regularization in many various analysis approaches. Here, LASSO regression model was used to screen core methylation sites involved in stage I–II LUAD patients’ OS. LASSO COX regression analysis was conducted by using a publicly available R package ‘glmnet’40 for 1000 iterations.

Data processing, normalization, and determination of differentially expressed methylation sites

The preprocessing of the raw data was performed to analyze the prognosis prediction classifier. The methylation site whose beta value was not available (NA) in any sample was deleted. Following this, we performed the normalization of the data by ‘betaqn’ function of wateRmelon package41. Moreover, the whole samples were divided into dead and alive groups based on recurrent status. The normalized beta was transformed to M value with the formulation: M = log(β/(1 − β)). M value was employed for the elimination of the bias caused by different probes. Finally, M value was exploited for the determination of the differentially expressed methylation sites between recurrent cluster and no recurrent cluster with ‘dmpFinder’ function of minfi package42.

Identification of methylomics-based signature

First of all, Univariate Cox regression analysis was conducted to screen the methylation sites associated with the OS of stage I–II LUAD patients. Then, the LASSO Cox regression analysis was performed via the screened methylation sites for the selection of the candidate methylation sites correlated with OS of stage I–II LUAD patients. Following this, the candidate methylation sites were analyzed by multivariate Cox regression analysis to determine the independent prognostic hallmark for OS of stage I–II LUAD patients. Eventually, an 11-DNA methylation signature was determined for the prediction of OS stage I–II LUAD patients. Risk score models were developed on the basis of the 11-DNA methylation signature to calculate the risk score of each sample. The median risk score was set as the cutoff point. Patients with stage I–II LUAD were categorized into “high-risk” or “low-risk” cohorts based on a high and low score, respectively. Log-rank testing of the Kaplan–Meier curve was performed via the “survival” package43 to assess the difference in OS of the two cohorts. ROC analysis was performed via the ‘survivalROC’ package44 and the area under the ROC curve (AUC) was employed to evaluate the predictive performance of the hallmark. The greater the AUC value of a hallmark, the better the predictive capacity of the marker. Unless otherwise noted, all curves were plotted using R (version 4.0.2$2020).

Gene set variation analysis (GSVA)

To reveal the established biomarker-associated signaling pathways, we used a GSVA package18 to evaluate DNA methylation risk score and enriched pathways activity conditions. The median risk score was set as the cutoff point to divide the patients into “high-risk” or “low-risk” cohorts. The exact pearson was drawn to analyze the correlations between enriched pathways and risk score. Significance was set as P < 0.05.

Construction of the nomogram

To improve the predictive value of established predicator for stage I–II LUAD patients’ OS, a nomogram was built via the ‘rms’ R package45. The univariate and multivariate Cox proportional hazard analysis were conducted on the basis of the methylation risk score and other clinical factors. Cox proportional hazard models was adopted to measure hazard ratios (HR) and corresponding 95% confidence interval (CI). We used the factors (P ≤ 0.05) from the multivariate Cox proportional hazard analysis to construct the nomogram. The nomogram was assessed based on C-index, ROC, DCA. The calibrate curve described the outcome of the nomogram, and the 45° line suggested the best prediction.

Ethics approval and consent to participate

Data obtained from the TCGA open-access database was collected from tumors of patients who provided informed consent based on the guidelines from the TCGA Ethics, Law and Policy Group.

Consent for publication

All patients included in the TCGA public domain database consented for publication as detailed in [https://cancergenome.nih.gov/abouttcga/policies/informedconsent].