Introduction

Lung cancer remains the leading cause of cancer mortality worldwide so far. In 2020, 2.2 million new cases were found globally and 1.8 million were dead, representing 18.0% of all cancer deaths [1]. Data suggest that around 68-92% of patients survive at least 5 years when diagnosed at the earliest stage, but this falls to just 10% for those diagnosed with the most advanced disease (stage IV) [2]. In most people, the cancer has already spread beyond its original site to a distant part of the body by the time they have symptoms and seek medical care. Early detection of lung cancer helps to improve treatment and survival [3].

Low-dose Computed Tomography (CT) is commonly used for detection of pulmonary nodules, but the ambiguous risk evaluation often causes overdiagnosis. Numerous antigens in the blood have been investigated for years as potential biomarkers of lung cancer. The most intensively studied biomarkers include cytokeratin 19 fragment (CYFRA 21 − 1) [4], carcinoembryonic antigen (CEA) [5], neuron specific enolase (NSE) [6], and squamous cell carcinoma antigen (SCC-Ag) [7]. But the performances of those biomarkers for early diagnosis are unsatisfactory due to the low sensitivity. Accordingly, it is highly desirable to find effective and specific diagnostic biomarkers for early-stage lung cancer.

Plenty of studies have shown that DNA methylation is strongly related to the occurrence and progression of various tumors [8]. DNA methylation is an epigenetic modification of genes involving the covalent transfer of S-adenosylmethionine as methyl group donor to the C-5 position of the cytosine ring of DNA to form 5-methylcytosine by catalysis of DNA methyltransferases [9]. According to the reported studies, extensive DNA hypomethylation was found in the whole genome of tumor cells, leading to the activation of proto-oncogenes and increased genomic instability [10]. The methylation status of tumor cells in the promoter regions of tumor suppressor genes and repair genes is increased, that is, hypermethylation, which leads to the inhibition of the expression of corresponding tumor suppressor genes [11, 12]. The hypermethylated genes of tumor cells mostly occur in CpG islands in the promoter region, while the CpG islands in the promoter region of normal cells are mostly in a non-methylated state [8]. Aberrations in DNA methylation are found at the genomic level in most tumors, including lung cancer, as well as in patients with non-neoplastic diseases such as Alzheimer’s disease and heart failure [13, 14]. Many studies have found that different diseases and even different stages of a disease may have specific methylation patterns [15]. The frequency of CpG island hypermethylation in tumor cells is much higher than that of gene mutation [16]. Therefore, by detecting the methylation level of a specific set of genes or the whole genome, it is possible to predict the risk of lung cancer [17,18,19].

In this study, based on our DNA methylation sequencing data, we identified seven novel methylation biomarkers by comparing the methylation profiles of tissue samples from lung cancer and benign lung disease for early diagnosis of lung cancer. Based on the 7-DMR biomarker panel, we constructed a new diagnostic model that could predict the malignant risk of lung cancer based on blood samples and could be further developed as a noninvasive diagnostic test.

Methods

Participating patients and Sample Collection

A total of 317 subjects were recruited, including 50 healthy controls with matched age and gender, and 267 patients with lung nodule indicated by CT/LDCT scan at The First Affiliated Hospital of Soochow University in China from Jan 2020 to Dec 2021. All enrolled patients with lung diseases were at high risk of lung cancer and thus had undergone surgical resection. None of patients received any preoperative cancer therapies. 10 mL of peripheral blood was collected from eligible patients 1–3 days prior to surgical operation. Formalin-fixed paraffin embedded (FFPE) tissue samples were obtained from subsequently surgical resections. Pathological information of all samples was determined based on surgically resected tissue sections according to 2015 WHO Histological Classification of Lung Cancer [20]. Written informed consents were provided by all participants. This study was approved by the Ethical Committees of The First Affiliated Hospital of Soochow University.

As additional detail on methods of tissue/blood sample processing, targeted cell-free DNA methylation sequencing, and sequencing data analysis are provided in an online data supplement.

Differential methylation analysis

Differential methylation analysis was conducted using R package DSS (version 2.14.0) [21]. Differentially methylated CpGs (DMCs) were first identified (criteria: FDR < 0.05, delta > 0.05), and then adjacent DMCs were merged into DMRs. The DMR required at least 3 CpG sites and the distance between nearby CpG sites was not more than 100 base pairs. DMRs were intersected with protein-coding genes (hg19 Ensembl (v75), n = 20,232) by using annovar [22].

Unsupervised hierarchical clustering of DNA methylation profiles

The methylation profiles of tissue samples in discovery cohort and independent validation cohort obtained with our custom-made methylation panel consisting of 9307 informative lung cancer DMRs were used for unsupervised hierarchical clustering. The methylation level of each targeted regions was calculated as the ratio of the methylated CpGs and the total sequenced CpGs (sum of methylated and unmethylated CpGs). Before clustering, the methylation level of each targeted region was Z-score normalized. To calculate the Z-score of each targeted region for each sample, we subtracted its mean from each of the samples and divided the result by its standard deviation. The R function “hclust” was used to perform hierarchical clustering with “ward.D2” as the clustering algorithm. R package pheatmap was used to plot the heat map after hierarchical clustering.

Diagnostic marker selection

The machine learning task was conducted with the intention to identify a DNA methylation biomarker panel for accurate diagnosis of early-stage lung cancer. We first filtered DMRs for a maximum of 30% of coefficient of variation calculated from analytical replicates (quality assessment samples) to ensure good analytical reproducibility of the selected DMRs. We then performed marker selection using a Python implementation (https://github.com/smazzanti/mrmr) of the minimum redundancy and maximum relevance (mRMR) feature selection algorithm. We examined the relationship between model performance and number of features (from 1 to 10) based on fivefold, ten times cross-validation in the discovery cohort. We limited the maximum number of features to 10 out of practical considerations: a marker panel based on a relatively small set of DMRs may be easier to translate and implement into clinical practice. We then determined the optimal number of features (or marker panel) according to the ‘maximal AUC score using the minimal set of DMRs’ principle.

Gene Ontology Enrichment Analysis and Pathway Enrichment analyzes for diagnostic biomarkers

KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis and GO (Gene Ontology) pathway analysis were conducted by the clusterProfiler R package.

Analysis of gene expression with TCGA data

Gene expression data were downloaded from the cancer genome atlas (TCGA) database (https://portal.gdc.cancer.gov/). Gene expression data of Lung Adenocarcinoma (LUAD) (59 normal, 535 cancer) and Lung Squamous Cell Carcinoma (LUSC) (49 normal, 502 cancer) cancer tissues and adjacent tissues were collected. The gene expressions of HOXB4, B3GNTL1, ZNF808, HOXD8, ITGA4, PTGER4, and HOXA7 were compared between lung cancer tissue and adjacent normal tissue.

Performance evaluation of 7 candidate diagnostic biomarkers in TCGA data

DNA methylation datasets in which methylation level of each CpG site was denoted by beta value were retrieved from the cancer genome atlas (TCGA) database (https://portal.gdc.cancer.gov/). The DNA methylation level of HOXB4, B3GNTL1, ZNF808, HOXD8, ITGA4, PTGER4, and HOXA7 in Lung Adenocarcinoma (31 normal, 473 cancer) and Lung Squamous Cell Carcinoma (42 normal, 370 cancer) cancer tissues and adjacent tissues were collected.

Construction of 7-DMR model

First, to evaluate diagnostic performance of the 7-DMR methylation panel for classifying lung cancer tissue samples, a predictive model was developed by fitting a logistic regression model using the 7 DMRs in the discovery cohort as the input. Python’s scikit-learn package (v0·20·0) was used to perform the logistic regression with default parameters: penalty = l2, tol = 1e-4, C = 1.0, fit_intercept = True, class_weight = None, solver = lbfgs, max_iter = 100. Then, to test the discriminative power of the 7-DMR methylation panel for the noninvasive detection of early-stage lung cancer with cell-free DNA from plasma samples, all tissue samples were pooled to construct a predictive model with logistic regression. The model was then applied to a plasma cohort consisting of patients with lung cancer and benign diseases, as well as healthy controls.

Statistical analyses

All statistical analyses were conducted using R software (v3.32). Continuous variables were presented as means and standard deviations or medians and ranges, while categorical variables were presented as whole numbers. Continuous variables were compared using Student’s t test, while categorical variables were compared using the chi-square test. 95% confidence intervals (CI) for AUC, sensitivity, specificity, accuracy of the models was calculated using a binomial distribution. Receiver operating characteristic (ROC) analysis was performed using the pROC R package (v1.15.3). Unless otherwise specified, all statistical tests were conducted using a two-sided alpha level of 0.05.

Results

Clinical cohorts

We collected a total of 198 tissue samples that were used to find differential DNA methylation biomarkers for early diagnosis of lung cancer. 21 subjects were excluded due to inadequate DNA after extraction (n = 15) and failed sequencing (n = 6). Consequently, 96 samples (80 lung cancers, 16 benign lung diseases) were used for 7-DMR model discovery cohort and the remaining 81 samples (64 lung cancers, 17 benign diseases) were used as an independent validation cohort. In total, 119 plasma samples were collected to evaluate the performance of the diagnostic model. 13 were excluded due to inadequate DNA or failed sequencing and the rest 106 subjects (26 lung cancers, 30 benign diseases, 50 healthy controls) were included for subsequent analysis (Fig. 1). There was no statistically significant difference in age among all three cohorts. The cohorts contained 88.2% of early-stage patients (stages 0/I) for identifying features correlated with early-stage lung cancer. 71.4% of all were never smokers. The patient demographic and clinical characteristics were summarized in Table 1.

Fig. 1
figure 1

Flowchart for finding lung cancer candidate diagnostic biomarkers. Total 317 patients enrolled. 7-DMR model was developed on 96 tissue samples and independently validated on 81 tissue samples. The noninvasive diagnostic performance of 7-DMR model was evaluated in 106 plasma samples.

Table 1 Patient Demographic and Clinical Characteristics

Identification of differential methylation regions for lung cancer diagnosis

By comparing the methylation profiles of tissue samples between lung cancer and benign lung disease, 6604 hypermethylated and 2703 hypomethylated DMRs were found (Fig. 2a-b), corresponding to 2614 hypermethylation and 1228 hypomethylated genes. The abnormal methylation regions were predominantly located at the intron, intergenic, and exon regions (Fig. 2c), and highly enriched in CpG island, promoter, CTCF binding site, promoter flanking region, and TF binding site (Fig. 2d). This was consistent with the general characteristics of aberrant DNA methylation in solid tumors. KEGG pathway analysis indicated that the following pathways were closely associated with the genes: Regulation of actin cytoskeleton, Non-small cell lung cancer, Wnt signaling pathway, Axon guidance, Hippo signaling pathway, etc. (Figure S1). Meanwhile, GO enrichment pathway analysis shown that a variety of cellular components, molecular functions and biological processes may be involved, especially the axon part, DNA-binding transcription activator activity, and embryonic organ morphogenesis (Figure S2a-c).

Fig. 2
figure 2

Identifying lung cancer-specific differentially methylated regions. (a-b) Heatmap of the differentially methylated sites in lung cancer and benign pulmonary nodule tissues in training dataset (a) and validation dataset (b), contains 6604 hypermethylated and 2703 hypomethylated regions. (c) The region on the genes where the hypermethylated and hypomethylated sites are located. (d) The correlation between hypermethylated and hypomethylated DMRs and regulatory regions in genome.

Identification of candidate methylated biomarkers for lung cancer diagnosis

Then we used the minimum redundancy and maximum relevance (mRMR) algorithm to assess the predictive power of each DMR and finally selected the most significant 7 DMRs: chr17:46655603–46,655,750, chr7:27195684–27,195,794, chr2:176993563–176,993,743, chr2:182322423–182,322,574, chr19:53038958–53,039,010, chr5:40681077–40,681,250, and chr17:80943984–80,944,093. The genes corresponding to these 7 methylation regions were HOXB4, HOXA7, HOXD8, ITGA4, ZNF808, PTGER4, and B3GNTL1 (Table 2). To investigate the correlation between the seven differential methylation regions and progression of lung cancer, we compared the expression of these seven corresponding genes in lung cancer and normal controls based on TCGA gene expression data, which contained LUAD (59 normal, 535 cancer) and LUSC (49 normal, 502 cancer). It turned out that the expressions of B3GNTL1 and HOXD8 were significantly upregulated, while the expression of remaining five genes (HOXB4, ZNF808, ITGA4, PTGER4, and HOXA7) were significantly downregulated in lung cancer tissues (p < 0.01) (Fig. 3a). This revealed the potential biological and clinical significance of the seven genes in the formation of lung cancer.

To test the diagnostic capabilities of seven markers in distinguishing between lung cancer and normal controls, we analyzed the performance of seven DMRs individually based on TCGA database that included LUADs (31 normal, 473 cancer) and LUSCs (42 normal, 370 cancer). Regarding the LUAD, all the 7 markers achieved AUCs varied from 0.90 to 0.97. While for the LUSC, the models of PTGER4 and B3GNTL1 reached AUCs of 0.75 (95%CI: 0.70–0.79) and 0.77 (0.73–0.81), respectively. ITGA4 and HOXB4 yielded AUCs of 0.86 (0.81–0.90) and 0.82 (0.78–0.86), respectively. The remaining markers (HOXA7, HOXD8, ZNF808) achieved AUCs greater than or equal to 0.94 (Fig. 3b). Collectively, this suggested that the 7 DMRs had excellent performance and merited further investigation.

Table 2 The 7 DMRs for lung cancer diagnosis
Fig. 3
figure 3

Gene expression and diagnostic performance of the 7 DMRs in TCGA. (a) The comparison of gene expression levels between lung cancer and normal controls based on TCGA data. (b) The representative ROC curves display the classification performance of each DMR in LUAD/LUSC vs. normal based on TCGA data.

Evaluation of the accuracy of Diagnostic model based on tissues

Next, we built a diagnostic model based on the panel of seven DMRs, namely 7-DMR model, using a training set of 80 lung cancer and 16 benign lung disease tissues. Accuracy of model in tissues was tested through the validation set of 64 lung cancers and 17 benign diseases. Our model achieved an AUC of 0.97 (0.93-1.00), sensitivity of 0.89 (0.82–0.95), specificity of 0.94 (0.89–0.97), and accuracy of 0.90 (0.84–0.96) in the discovery cohort and an AUC of 0.96 (0.92-1.00), sensitivity of 0.92 (0.86–0.98), specificity of 1.00 (1.00–1.00), and accuracy of 0.94 (0.89–0.99) in the independent validation cohort (Fig. 4a-c; Table 3). Unsupervised hierarchical clustering of these 7 markers was able to distinguish lung cancers from benign lung diseases with high specificity and sensitivity (Fig. 4d-e).

Table 3 The 7-DMR Model Performance Metrics, Values presented as: Mean, (95% C.I.)
Fig. 4
figure 4

Diagnostic performance of 7-DMR model in tissues. (a-b) Confusion tables of binary results of the 7-DMR model in the training (a) and validation data sets (b). (c) The representative ROC curves of 7-DMR model in lung cancer and benign nodule tissues in both discovery and validation cohorts. (d-e) Unsupervised hierarchical clustering of seven methylation markers for 7-DMR model in the training (d) and validation data sets (e) for tissues. LC: lung cancer

Evaluation of the accuracy of Diagnostic model in plasma

Ideal biomarkers are expected to be detected non-invasively in biological fluids, we further tested the accuracy of 7-DMR model in 106 plasma samples. Consistent with the performance in tissue samples, the 7-DMR model achieved AUCs of 0.93 (0.86-1.00) in lung cancers vs. benign diseases, and 0.94 (0.86-1.00) in lung cancers vs. healthy controls. Incorporating the benign diseases and healthy controls as non-cancer group, the 7-DMR model still maintained stable diagnostic performance with an AUC of 0.94 (0.86-1.00), sensitivity of 0.81 (0.73–0.88), specificity of 0.98 (0.95-1.00), and accuracy of 0.93 (0.89–0.98) (Fig. 5a-b; Table 3). The precise diagnostic capability of the seven DMRs was also confirmed by the unsupervised hierarchical clustering among lung cancer, benign disease, and healthy control (Fig. 5c).

Fig. 5
figure 5

Diagnostic performance of 7-DMR model in plasmas. (a) Confusion tables of binary results of the 7-DMR model in plasma. (b) The representative ROC curves of 7-DMR model in plasmas of lung cancer, benign disease, and healthy control. (c) Unsupervised hierarchical clustering of seven methylation markers for 7-DMR model in plasmas.

Discussion

At present, the incidence of lung cancer and other cancers has risen sharply [23]. Although traditional pathological examination is still the gold standard for the diagnosis of various tumors, an accurate, non-invasive, and rapid diagnostic test is urgently needed in clinical practice. Numerous efforts have been devoted to searching for effective biomarkers. As aberrant DNA methylation patterns have been identified in lung cancer, DNA methylation biomarkers have been intensively investigated as potential diagnostic markers to detect early-stage lung cancer [24]. As early as 2005, Schmiemann V et al. discovered the abnormal methylation level of APC, p16 (INK4a), and RASSF1A genes in lung cancer patients, and proposed using methylation biomarkers for early diagnosis of lung cancer [16]. Recent research using prospectively and pre-diagnostic peripheral collected blood samples, a readily accessible sample source, is expected to provide valuable predictive marker data [25].

In this study, we systematically analyzed the methylation data of lung cancer. By comparing the methylation profiles of tissue samples between lung cancer and benign lung disease, we identified seven unique alterations in methylation that could function as promising biomarkers for early diagnosis of lung cancer. Each marker could accurately distinguish lung cancers from normal control. A diagnostic model of lung cancer constructed by the panel of 7 DMRs achieved a sensitivity of 92.2% and accuracy of 93.8%. The performance of the diagnostic model was evaluated in a set of plasma samples and well-discriminated results were also obtained. The abnormal expression of the DMRs related genes in lung cancer was confirmed by the TCGA gene expression data. The above data revealed that the 7 DMRs may play significant roles in development and progression of lung cancer that could be promising candidates for the development of diagnostic biomarkers in early-stage lung cancer.

Strengths and limitations

Ideal diagnostic biomarkers are expected to be highly sensitive, specific to lung cancer, and non-invasively detectable at the early stage. We tested the seven biomarkers in a small set of plasma samples, and it showed superior diagnostic performance, indicating that the seven DMRs could be potentially applied as biomarkers in clinical practices. But the sample size was limited in this study and next we are going to recruit more participants to verify the generalization of the model. The development of lung cancer is a complex process involving multiple genetic, epigenetic, and protein expression alterations. Constructing predictive models using methylation biomarkers merely to assess the diagnosis of lung cancer may be inadequate. In the future, we will consider combining multi-omics such as radiomics, DNA fragmentation patterns, and proteomic biomarkers to further improve the predictive performance. Furthermore, intensively investigation on the functions of the targeted genes is necessary to clearly elucidate the molecular events occurring in the lung cancer development and progression.

Conclusion

In summary, we identified seven novel lung cancer specific methylation markers that was able to discriminate the lung cancer from non-lung cancer. Our study demonstrates that the 7-DMR panel is of great value in the diagnosis of early-stage lung cancer, and thus may be potentially utilized as a noninvasive risk assessment tool for lung cancer before resection surgery.