Introduction

Lung cancer, is the most common malignant tumor and is the leading cause of cancer-related morbidity and mortality worldwide1. Non-small-cell lung carcinoma (NSCLC), including lung squamous cell carcinoma and lung adenocarcinoma (LUAD), accounts for approximately 85% of lung cancer2. Despite advances in clinical and experimental oncology, the early diagnostic and prognosis biomarkers of NSCLC is still unsatisfactory3. Therefore, there is an urgent need to identify new biomarkers to predict the clinical outcome of NSCLC patients.

Smoking, exposure to environmental tobacco smoking, residential radon, cooking oil fumes and particulate matter 2.5 (PM2.5) are main environmental risk factors associated to the occurrence of lung cancer4,5. Smoking is the major risk factor for lung cancer, accounting for about 90% of male and 70% of female6. Smoking causes multiple alterations to cells and tissues, including DNA single-strand breaks7, chromosome exchanges8 and chromosome instability (CIN)9, which can lead to genomic instability (GI). GI, mainly including CIN and microsatellite instability (MSI), has been an important hallmark of various human cancers10,11,12. GI has been proven to be closely related to the clinical diagnosis and prognosis of multiple malignant tumors12,13. Recent studies have clarified that GI can be used as a prognosis marker in cervical cancer14 and breast cancer15. Emerging evidence revealed that GI was associated with tumorigenesis in lung cancer16,17, however, the potential role and mechanism of GI in lung cancer need further be explored.

LncRNAs are defined as nonprotein coding transcripts more than 200 nucleotides in length18. Accumulating evidence suggested lncRNAs involve in gene expression at epigenetic, transcription and post-transcriptional levels19,20,21. Non-coding RNA activated by DNA damage (NORAD) and long non coding transcriptional activator of miR34a (GUARDIN) could maintain genomic stability by involving in DNA replication and repair in LUAD and colon cancer22,23. LncRNA DDSR1, a DNA damage-sensitive RNA1, modulated DNA repair by regulating homologous recombination in osteosarcoma24. However, the function and clinical roles of GI-associated lncRNAs in NSCLC remain unknown.

In this study, we systematically analyzed genomic data in LUAD from TCGA, and 185 differentially expressed GI-associated lncRNAs were enriched in chromosome formation and cell cycle checkpoint pathways. 5 GI-associated lncRNAs were identified through univariate and multivariate Cox regression analysis and used to construct a GI-associated lncRNAs signature (GILncSig) model. Our results demonstrated that the GILncSig may be a potential biomarker for the diagnosis and prognosis of LUAD and targeting 5 GI-associated lncRNAs could act as a therapeutic alternative for NSCLC.

Materials and method

Data acquisition

RNA sequencing transcriptome date (n = 594, Table S1), somatic mutation information (n = 561, Tables S23) and the corresponding clinicopathological features (n = 522, Table S4) of the LUAD patients were downloaded from TCGA (https://portal.gdc.cancer.gov/). The mutation information mainly refers to single-nucleotide variants (SNVs), copy number variation (CNV), MSI. LncRNA expression data were looked up from the previous analysed RNA expression data which were annotated by the GENCODE project (Version GRCh37, http://www.gencodegenes.org).

Samples from patients with overall survival (OS) of <  = 30 days were excluded25. In the end, a total of 490 LUAD patients with paired lncRNA and mRNA expression data, somatic mutation information and clinicopathological characteristics were enrolled in our study to build the GILncSig model.

To increase the reliability of our research, we randomly and equally divided the entire dataset into a training set (n = 246), a validation set (n = 244) and the whole dataset was considered as a combination set (n = 490). The workflow of this work was shown in Fig. 1. Clinicopathological features of the 490 LUAD patients were shown in Table 1.

Figure 1
figure 1

The flowchart depicting the process of data collection and analysis.

Table 1 Clinicopathological features of lung adenocarcinoma patients in the training set, validation set and combination set.

Identification of GI-associated lncRNAs

To obtain GI-associated lncRNAs, as the method described by Bao et al.26: (a) the cumulative number of somatic mutations for each patient was calculated; (b) patients were arranged in a decreasing order based on their cumulative number of somatic mutations; (c) those in the top 25% of patients were categorized as genomic unstable (GU)-like team, while those in the last 25% were classified as genomically stable (GS)-like team; (d) expression profiles of lncRNAs between the GU-like team and GS-like team were compared using significance analysis of microarrays (SAM) method; (e) differentially expressed lncRNAs (log|fold change|> 1 and false discovery rate (FDR) adjusted P < 0.05) were defined as genome instability-associated lncRNAs.

Evaluation of risk score

By linearly combining the expression value of GI-associated lncRNAs weighted by their coefficients, a risk-score formula was constructed as following27:

$${\text{Risk }}\;{\text{score }}\; = \;\mathop \sum \limits_{{\text{i = 1}}}^{{\text{n}}} {\text{coefi}}\; \times \;{\text{expri}}$$

where risk score was a prognostic risk score for the LUAD patients, coefi represented the coefficient, and expri represented the expression of each prognostic GI-associated lncRNA. Based on the median risk score, the LUAD patients were classified into high-risk (n = 254) group and low-risk group (n = 236).

Co-expression network

To measure the correlation between lncRNAs and mRNAs, Pearson correlation coefficients was conducted and the top 10 mRNAs were considered as co-expressed lncRNA-associated partners, a lncRNAs–mRNAs co-expression network was constructed.

Functional enrichment analysis

To reveal the potential function of the co-expressed lncRNAs and mRNAs, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analysis of lncRNA-correlated PCGs were performed using clusterProfiler soft-ware.

Building and validation of a Nomogram

To know the prognosis value of lncRNA, as the method employed by Iasonos et al.28, a nomogram was constructed by including the expression of the 5 lncRNAs in the combination set and a total score of 1-year, 2-year, and 3-year OS were evaluated. Calibration plot was further performed to evaluate the calibration and the discrimination of the nomogram by a bootstrap method with 1000 resamples.

Statistical analysis

Hierarchical cluster analysis was performed using Euclidean distances and Ward’s linkage method. Kaplan–Meier analysis was used to calculate the OS. Univariate Cox and Multivariate Cox regression and stratified analysis were used to verify the independence of the GILncSig from other clinical factors and investigate the time-dependent prognostic value of the GILncSig in cancers. Hazard ratio (HR) and 95% confidence interval (CI) were calculated by Cox analysis. Receiver operating characteristic (ROC) was used to investigate the time-dependent prognostic value of the GILncSig. Principal component analysis (PCA) was performed to study the expression patterns in the different groups. All statistical analyses were performed using R-version 4.0.3. A P-value of less than 0.05 was considered statistically significant.

Results

Identification of GI-related lncRNAs in LUAD patients

The cumulative number of somatic mutations in each patient was calculated and ranked to identify GI-associated lncRNAs. The first 25% of patients (n = 134) were assigned to GU-like team and the last 25% to GS-like team (n = 139) (Table S5). To find lncRNAs with significant differences, mRNA and lncRNA expression profiles (Tables S67) in each team were compared. With log|fold change|> 1 and FDR-adjusted P-value < 0.05, a total of 185 lncRNAs (candidate genomic instability-related lncRNAs) were considered to be significantly differentially expressed (Table S8). Hierarchical clustering analysis was conducted on the 535 samples in the TCGA set. Through the expression of the 185 differentially expressed lncRNAs, all 535 samples were clustered into two groups (Fig. 2A and Table S9). The group with higher cumulative somatic mutations was defined as GU-like group and the other GS-like group. The count of somatic cumulative mutations, deletion mutation and gene amplifications in the GU-like group was significantly higher than that in the GS-like group (p < 0.001, Fig. 2B–D).

Figure 2
figure 2

The identification of long non-coding RNAs related to GI and subsequent functional enrichment analysis. (A) Clustering analysis of 535 LUAD patients based on the expression of 185 candidate genomic instability-related lncRNAs. The left cluster is GU-like group, and the right cluster is GS-like group. The x-axis represents lung adenocarcinoma patients. The y-axis represents 185 candidate genomic instability-related lncRNAs. (BD) Boxplots of somatic mutations (B), deletion mutation (C) and gene amplifications (D) in the GU-like group and GS-like group. (E) Co-expression network of GI-related lncRNAs and mRNAs based on the Pearson correlation coefficient. The blue circles represent lncRNAs, and the blue circles represent mRNAs. (F, G) Functional enrichment analysis of GO (F) and KEGG (G) for co-expressed lncRNAs and mRNAs. ***p < 0.001; LUAD, lung adenocarcinoma; GI, genomic instability; GU, genetic unstable; GS, genetic stable; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

To further explore the potential functions and pathways of the above 185 lncRNAs in LUAD, functional enrichment analysis was analyzed. The 10 most relevant protein-coding genes (PCGs) for each lncRNA among the 185 lncRNAs was found by using the method of co-expression analysis, then the lncRNAs–mRNAs co-expression network was constructed. In the network, the nodes were lncRNAs and mRNAs, they linked together if they were related to each other (Fig. 2E). The regulating effects of lncRNAs and mRNAs in the network were characterized in Table S10, all lncRNAs and mRNAs had cis regulating effects. GO and KEGG analysis of lncRNA-correlated PCGs revealed that mRNAs in this network were significantly associated with GI such as cilium movement, microtubule bundle formation, motile cilium, chromosomal region, microtubule motor activity and cell cycle (Fig. 2F and G). These results suggested that the 185 differentially expressed lncRNAs affected GI through lncRNA-related PCGs regulatory network and they can be candidate biomarkers for GI-associated lncRNAs.

Identification of a GILncSig for prognostic validation in the training set

Univariate Cox regression analysis of the 185 differentially expressed lncRNAs in the training set showed that 7 GI-associated lncRNAs (LINC02587, AC026785.3, AC012085.2, FAM83A-AS1, MIR223HG, MIR193BHG, LINC01116) had the significant prognostic value in LUAD patients (p < 0.05, Fig. 3A). The correlation analysis indicated 7 GI-associated lncRNAs significantly interacted with each other (Fig. 3B). 5 GI-associated lncRNAs (AC012085.2, FAM83A-AS1, MIR223HG, MIR193BHG, LINC01116) were identified by multivariate Cox regression analysis and then were further used to develop the GILncSig model (Fig. 3C). In the GILncSig, AC012085.2, FAM83A-AS1, MIR193BHG, LINC01116 acted as risk factors for LUAD, and MIR223HG acted as a protective factor. The coefficients were shown in Table 2. The risk score of each patient was calculated by the sum of the coefficients of each lncRNA multiplied by the corresponding expression in each patient. Based on the median risk score, the training set was classified into high-risk group (n = 123) and low-risk group (n = 123). As shown in Fig. 3D, the expression level of the risk lncRNAs (AC012085.2, FAM83A-AS1, MIR193BHG, LINC01116) was upregulated, while the protective MIR223HG was downregulated, and the count of somatic mutations was positively correlated with the patient’s risk score. The count of somatic mutation of patients in the low-risk group was significantly lower than that of patients in the high-risk group (p < 0.001, Fig. 3E), and our results were similar to those of Matuno et al.29. The Kaplan–Meier curve indicated that the patients in the high-risk group had a poorer OS than those in the low-risk group (p < 0.001, Fig. 3F). The time-dependent ROC curve analysis of GILncSig for 1-, 2-, and 3-year OS were 0.785, 0.731 and 0.759 respectively (Fig. 3G). PCA analysis showed low- and high-risk groups were significantly distributed in two different directions, indicating that the LUAD patients in the low-risk group was quite distinguished from those in the high-risk group (Fig. S1A).

Figure 3
figure 3

Identification of the GILncSig in the training set. (A) The forest plot of univariate cox regression identified 7 GI-associated lncRNAs. (B) The correlation analysis among the 7 GI-associated lncRNAs. (C) The forest plot of multivariate Cox regression analysis of 5 GI-associated lncRNAs. (D) The expression and somatic mutation count with increasing risk score of the GILncSig. (E) Boxplot of somatic mutation count in the high- and low-risk groups. (F) Kaplan–Meier survival curve of the high- and low-risk groups. (G) Time-dependent ROC curves and area AUC for 1-, 2-, and 3-year OS. *** p < 0.001; GI, genomic instability; ROC receiver operating characteristic; AUC, under the curve.

Table 2 5 prognostic GI-associated lncRNAs identified from univariate and multivariate Cox regression analysis.

Validation of the GILncSig

To further verify the accuracy of the GILncSig model, the 5 coefficients were applied to the validation set (n = 244) and the combination set (n = 490) to confirm the risk score of each patient, then the validation set was classified into the high-risk group (n = 131) and low-risk group (n = 113) and the combination set was classified into the high-risk group (n = 254) and low-risk group (n = 235). In the validation set, with the increase of patient’s risk score, the expression level of the risk lncRNAs was also upregulated, the protective lncRNAs was downregulated and the count of somatic mutations also increased (Fig. 4A). The somatic mutation count of patients in the low-risk group was lower than that of patients in the high-risk group (p < 0.001, Fig. 4B). Patients in the low-risk group had better OS rates than in the high-risk group (p = 0.027, Fig. 4C). The ROC curves showed that the AUCs for the 1-, 2-, and 3-year OS were 0.676, 0.590, and 0.576 (Fig. 4D). PCA showed that the low- and high-risk groups were divided into two different clusters (Fig. S1B). Similarly, the results were also validated in the combination set (p < 0.001, Figs. 5A–D and S1C).

Figure 4
figure 4

The prognostic values of the GILncSig in the validation set. (A) The expression and somatic mutation count with increasing risk score of the GILncSig. (B) Boxplot of somatic mutation count in the high- and low-risk groups. (C) Kaplan–Meier survival curve of the high- and low-risk groups. (D) ROC curves and AUC for 1-, 2-, and 3-year OS. ***p < 0.001; ROC receiver operating characteristic; AUC, under the curve; OS, overall survival.

Figure 5
figure 5

The prognostic values of the GILncSig in combination set. (A) The expression and somatic mutation count with increasing risk score of the GILncSig. (B) Boxplot of somatic mutation count in the high- and low-risk groups. (C) Kaplan–Meier survival curve of the high- and low-risk groups. (D) ROC curves and AUC for 1-, 2-, and 3-year OS. ***p < 0.001; ROC receiver operating characteristic; AUC, under the curve; OS, overall survival.

Univariate Cox regression analysis was performed using the survival package to investigate the time-dependent prognostic value of the GILncSig in cancers. The results indicate that the GILncSig is a significant risk factor for Adrenocortical carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Colon adenocarcinoma (COAD), Kidney renal papillary cell carcinoma (KIRP), Brain Lower Grade Glioma (LGG), LUAD, Pancreatic adenocarcinoma (PAAD), Stomach adenocarcinoma (STAD) and Thymoma (THYM) (Fig. S2 and Table S11).

Somatic mutation types in the combination set

To explore the mutation types of LUAD, the R package “maftools” was used to visualize mutation data from 490 samples in TCGA-LUAD. In the high-risk group, missense mutation was predominant in variant classification (Fig. S3A and S3E); SNP was the most frequent in variant type (Fig. S3B); and the mutation from C to A was the most prevalent in the SNV class (Fig. S3C). Additionally, the median number of variants per sample was 226 (Fig. S3D). The top 10 mutated genes included TTN, MUC16, CSMD3, RYR2, TP53, LRP1B, ZFHX4, USH2A, FLG, and KRAS (Fig. S3F). Similar results were also found in the low-risk group (Fig.S3G–L). The low-risk group had significant higher arm-level amplification and deletion frequencies than the low-risk group (P < 0.05) in the combination set (Fig. S3M). In conclusion, we observed a distinct pattern in the occurrence of mutations during the progression of LUAD.

Independence of the GILncSig from other clinical factors

To identify the independent prognostic value of GILncSig, univariate and multivariate Cox regression analysis were performed on age, sex, TNM stage, and the risk signature. The tumor TNM stage is determined based on the size of the tumor (Tumor), involvement of lymph nodes (Node), and presence of metastasis (Metastasis) to assess the severity and prognosis of the tumor. The results indicated that the risk signature and TNM stage were independent factors when adjusted for age, sex and smoking in all sets (Table 3).

Table 3 Univariate and multivariate Cox regression analysis of the GILncSig and overall survival in each set.

In the multivariate Cox regression analysis, TNM stage were also identified as independent prognostic factor. Subsequently, a stratification analysis was performed to evaluate whether the GILncSig could predict patient survival within the same clinical factor subgroup. Patients in the combination group were stratified based on clinical parameters, such as age (< = 65/ > 65), sex (female/male), stage (I + II/III + IV). The results showed that the GILncSig could classify patients of the same stratum of age, sex, and stage into high- and low-risk groups. Patients in the high-risk group had a poorer OS than those with low-risk group in each stratum (Fig. 6A–F. These results indicated that the GILncSig was an independent prognostic factor related to the OS in LUAD.

Figure 6
figure 6

Stratified survival analyses and nomogram of the GILncSig in the combination set. (AF) Kaplan–Meier survival curves in subgroups stratified by different clinical characteristics. Age <  = 65 (A), Age > 65 (B), Female (C), Male (D), Stage I + II (E), Stage III + IV (F). (G) Nomogram for predicting 1-, 2-, and 3-year OS of LUAD patients. (H) Calibration curves for the nomogram. OS, overall survival; LUAD, lung adenocarcinoma.

Construction of a nomogram based on the GILncSig in the combination set

To construct a quantitative method for the prognosis of LUAD patients, we integrated the 5 GI-associated lncRNAs to establish a nomogram (Fig. 6G). The calibration curve for the nomogram indicated that using the nomogram to predict OS was highly consistent with actual OS (Fig. 6H).

Discussion

Since lung cancer has no symptoms in the early stage, majority of patients are already in the advanced stage when they are discovered. Although traditional treatments are constantly improving, the five-year survival rate of lung cancer is only 19%30. GI has been reported in various malignant cancers, including lung cancers11,14,16,17,31,32,33,34,35. CIN is the major type of GI in lung cancer, which leads to the high gene mutation burden by chromosome structure and number alterations in cancer cells11. In this study, we found that a total of 185 GI-associated lncRNAs were significantly differentially expressed. Functional analysis revealed that GI-associated pathways were significantly enriched, which indicates that GI-associated lncRNAs may be associated with tumorigenesis. 5 GI-associated lncRNAs significantly associated with OS. A novel prognostic model integrating 5 GI-associated lncRNAs was firstly constructed and differentiate different risk group.

Among the 5 GI-associated lncRNAs, 4 of them, including AC012085.2, FAM83A-AS1, MIR193BHG, LINC01116, acted as risk factors for LUAD, and MIR223HG was a protective factor. FAM83A-AS1 was up-regulated in lung cancer tissues and enhances the proliferation, migration, invasion, and epithelial-mesenchymal transition of LUAD36,37,38. IR193BHG was elevated and showed good clinical values for diagnosing early-onset preeclampsia39. LINC01116 mediated gefitinib resistance of NSCLC cells by affecting IFI44 expression40. LINC01116 overexpressed in lung cancer tissues and cell lines and was significantly associated with proliferation and metastasis40,41,42. MIR223HG, acting as a competing endogenous RNA, inhibited acute myeloid leukemia progression by inducing IRF4 expression43. These studies were consistent with our results. However, the role of AC012085.2 was unclear and need be further explored.

KEGG and GO enrichment analysis indicated that genes co-expressed with the 185 GI-related lncRNAs in the high-risk and low-risk LUAD patients identified in this study were not only enriched in many biological processes, such as metabolic process and oxidoreductase activity, they were also enriched in critical GI-related pathways, including cilium movement pathway, microtubule motor activity pathway, chromosomal region pathway, and cell cycle pathway. Dyskinesia and structural abnormalities of cilia can cause GI and lead to the occurrence of cancer44. Chromosome segregation requires stable microtubule attachment at kinetochores, when the dynamic of microtubules is insufficient and the orientation is deviated, it can lead to GI in human cells45,46. The mechanisms of GI also include extra centrosomes47, mutations of mitotic checkpoint genes48, faulty cell-cycle regulation49,50,51, and chromatid cohesion52. The mechanism of GI is very complicated and needs be further studied.

Meanwhile, there are several limitations to our study. (1) Due to the limited lncRNA chip of LUAD, we could only divide the TCGA data set into training set and validation set randomly and more independent data sets are needed to validate the GILncSig to ensure its robustness and reproducibility. (2) In vivo and in vitro experiments were not performed to verify the role of the GILncSig model, therefore, subsequent experiments are needed to validate the reliability of results. (3) As an observational study, confounding factors, such as pharmaceutical treatment, exposure to environmental tobacco smoking, residential radon, cooking oil fumes, PM2.5 and so on, may have an impact on our results.

Conclusions

In summary, in this study, we constructed a novel GILncSig model that may involve in the progression and prognosis in LUAD. Targeting GI-related lncRNAs may be a potential application for LUAD therapy. However, the underlying mechanisms involving in GI need be further explored.