Background

Lung cancer is the most common cancer and responsible for the largest number of deaths worldwide with 1.8 million deaths, 18.4% of the total [1]. Lung cancer is categorized into two main categories: non-small cell lung cancer (NSCLC) which occurs in 85% of patients and small cell lung cancer (SCLC) in 15% of cases. NSCLC is grouped into 3 histological sub-types: lung adenocarcinoma (LUAD) which is most common form of lung cancer, lung squamous cell carcinoma (LUSC) and large cell carcinoma [2].

The integration of different types of molecular data has been used to characterize the molecular basis of lung cancer and to determine the clinical status of patients. Shi et al. analyzed 101 LUAD samples by using data from different levels -DNA mutations, gene expression profile, copy number variations and DNA methylation- in order to identify the relation between the genomic status and the clinical status. They determined deleterious mutations at ZKSCAN1 and POU4F2 genes which are two novel candidate driver genes [3]. Furthermore, recent studies have been performed to generate new methods to analyze integrative cancer data. Berger et al. proposed a new method called expression-based variant-impact phenotyping (eVIP) using differentially expressed genes (DEGs) to distinguish impactful from neutral somatic mutations. They characterized 194 somatic mutations related to primary LUAD and claimed that 69% of mutations were impactful. They determined the functionally important and actionable variants such as EGFR (p.S645C), ERBB2 (p.S418T), ARAF (p.S214C) and ARAF (p.S214F) although they are rare somatic mutations [4]. TCGA research network analyzed 230 LUAD samples using mRNA, microRNA and DNA sequencing integrated with copy number, methylation and proteomic data and reported the samples with high rates of somatic mutation [5]. Eighteen genes with high mutation load were reported such as RIT1 activating mutations and MGA loss-of-function mutations. They also identified aberrations in NF1, MET, ERBB2 and RIT1 occurred in 13% of cases and MAPK and PI(3)K pathway activity [5]. Deng et al. presented genomic alterations in LUAD samples from TCGA and found the significantly aberrant CNV segments which are associated with the immune system and 63 mutated genes associated with lung cancer signaling related to cancer progression. They identified important mutations of the PI3K protein family members include PIK3C2B, PIK3CA, PIK3R1 [6].

Recently, studies have been performed to generate gene signatures predicting prognosis risk of patients with lung adenocarcinoma. Krzystanek et al. identified a 7-gene signature by using microarray data of early-stage lung adenocarcinoma from GEO datasets. The genes (ADAM10, DLGAP5, RAD51AP1, FGFR10P, NCGAP, KIF15, ASPM) which have high hazards ratios showed significant results at cox regression analysis and Kaplan-Meier survival plots [7]. Shukla et al. identified 96 genes including five long noncoding RNAs (lncRNAs) among training data which had a prognostic association at test data, by using lung adenocarcinoma RNA-seq and clinical data from TCGA [8]. Shi et al. studied long noncoding RNAs (lncRNAs) expression signature model to predict stage I lung adenocarcinoma from TCGA and determined 31-lncRNA signature to predict overall survival in patients with LUAD [9]. Zhao et al. used gene expression profiles from TCGA and identified 20 genes that were significantly associated with the overall survival (OS). When they combined with GEO data set, they obtained four genes, FUT4, SLC25A42, IGFBP1, and KLHDC8B as common [10]. Li et al. performed RNA-sequencing on LUAD tumor samples and normal tissue samples. They construct protein–protein interaction network by using DEGs which were the intersection of GEO datasets and identified hub genes. Then, they test these genes on patient cohorts and TCGA data. They identified eight genes (DLGAP5, KIF11, RAD51AP1, CCNB1, AURKA, CDC6, OIP5 and NCAPG) which were closely related to survival in LUAD [11]. He et al. studied on previous GEO datasets and TCGA data and they identified a 8-gene prognostic signature (CDCP1, HMMR, TPX2, CIRBP, HLF, KBTBD7, SEC24B-AS1, and SH2B1) by using the step-wise multivariate Cox analysis. These genes were good predictors of survival between the high-risk and low-risk groups of patients with early-stage NSCLC [12]. The studies above determined different gene signatures for prognosis risk prediction by using different methods and presented different genes. Although, mostly gene expression data has been used for this purpose, we integrated SNVs, CNVs, DEGs and active subnetwork DEGs to generate gene signature for risk model by using LUAD data from The Cancer Genome Atlas (TCGA) database which provides simple nucleotide variation, gene expression, miRNA expression, DNA methylation, copy number variation and reverse phase protein array, clinical and biospecimen data from more than 10,000 cancer patients with 39 cancer types [13].

In this study, we built an R pipeline (Fig. 1) to perform an integrative analysis including SNVs and CNVs, differentially expressed genes and clinical data of patients with lung adenocarcinoma in TCGA. We generated different data categories by using significant SNVs, CNVs, DEGs and active subnetwork DEGs. Multivariate Cox proportional hazards model with the Lasso penalty and LOOCV was used to identify best gene signature among different gene categories. We generated 12-gene signature (BCHE, CCNA1, CYP24A1, DEPTOR, MASP2, MGLL, MYO1A, PODXL2, RAPGEF3, SGK2, TNNI2, ZBTB16) for prognostic risk prediction based on overall survival time of the patients with lung adenocarcinoma. When we clustered patients into high-risk and low-risk groups, the survival analysis showed highly significant results for both training and test datasets.

Fig. 1
figure 1

The R pipeline used for construction and validation of the prognosis gene signature

Results

Identification of significant simple nucleotide variations

Mutation data of LUAD patients as maf file generated by mutect pipeline was downloaded by TCGAbiolinks package and maftools package was used to subset original maf file by tumor sample barcodes of 55 LUAD patients (who have paired RNAseq data) and 510 LUAD patients (all patients in LUAD project who have all types of data used in the study). Then, significant mutations for both 55 and 510 LUAD patients were determined separately with their roles as a tumor suppressor or an oncogene by SomInaClust R package. In order to determine important genes including significant mutation clusters, we used SomInaClust R package. EGFR, KRAS, TP53, STK11, RB1 and MGA genes were determined as candidate driver genes in tumor samples of 55 LUAD patients (Fig. 2). EGFR and KRAS genes were classified as oncogenes and STK11, RB1 and MGA genes were classified as tumor suppressors. Although TP53 gene has both OG score and TSG score, TP53 was classified as a tumor suppressor in Table 1 depending on reference information of the cancer gene census. EGFR, KRAS, TP53, STK11 and RB1 have highly significant estimation. While EGFR and TP53 have high number of mutations, KRAS, STK11, RB1 and MGA have low number of mutations. While EGFR, KRAS, TP53, STK11, RB1 are well known cancer related genes, MGA gene is not in the cancer gene census.

Fig. 2
figure 2

Pyramid plot of the significantly mutated genes in tumor samples of 55 patients with LUAD. The oncogene (OG) or tumor suppressor gene (TSG) scores were calculated by SomInaClust R package

Table 1 Significantly mutated genes in tumor samples of 55 patients with LUAD. The genes were classified as an oncogene (OG) and a tumor suppressor gene (TSG) based on their scores and the cancer gene census information by SomInaClust R package

Eighty-two genes were identified as candidate driver genes in tumor samples of 510 LUAD patients (Table 2), including KRAS, TP53, EGFR, STK11, MGA and RB1 which were determined also in tumor samples of 55 LUAD patients (Fig. 3). These genes include very well-known cancer related oncogenes such as BRAF, ERBB2, AKT1 and PIK3CA with the genes which are not listed in the cancer gene census list of the COSMIC database (Table 2).

Table 2 Significantly mutated genes in tumor samples of 510 patients with LUAD. The genes were classified as oncogene (OG) and tumor suppressor gene (TSG) based on their scores and the cancer gene census information by SomInaClust R package
Fig. 3
figure 3

Pyramid plot of the top 40 significantly mutated genes in tumor samples of 510 patients with LUAD. The oncogene (OG) and tumor suppressor gene (TSG) scores were calculated by SomInaClust R package

Identification of the significant copy number variations

CNVs (Copy Number Variations) are important aberrations which results alterations in gene expression in tumorigenesis and tumor growth. In order to determine the significant CNVs among tumor samples of 55 and 510 LUAD patients, gaia R package was used. Significant recurrent CNVs in tumor samples of 55 LUAD patients, over the q-value thresholds (0.01), are mostly observed on Chromosome 1, 8, 9, and 17. Chromosome 1 has the highest number of amplifications followed by Chromosome 8. Chromosome 9 has the highest number of deletions followed by Chromosome 17 as seen in Fig. 4. Chromosome 1 has the highest number of gene aberration with 2006 amplified or deleted genes followed by Chromosome 8 with 1029 aberrant genes and Chromosome 19 with 785 aberrant genes. Top ten significantly amplified and deleted genes which are all from chromosome 1 are listed in Table 3.

Fig. 4
figure 4

Significant CNVs on all chromosomes in tumor samples of 55 patients with LUAD. Orange line shows the q-value threshold (0.01) in -log10 base. Red lines which have positive scores shows the amplification of genomic regions on specified chromosome; blue lines which have negative scores shows the deletion of genomic regions on specified chromosome. The genomic regions above threshold were selected for further gene enrichment analysis. Amp: Amplification, Del: Deletion

Table 3 Top ten significant deleted and amplified genes in tumor samples of 55 patients with LUAD. After gene enrichment analysis, the significantly amplified and deleted genes were listed with their aberration type, q value, genomic aberration region and gene region on specified chromosomes

Significant recurrent CNVs in tumor samples of 510 LUAD patients, over the q-value thresholds (0.01), are mostly observed on Chromosome 4, 9, 10, 11, 12, 13, 14, 16, 18 and 20. But Chromosome 11 has the highest number of aberrations followed by Chromosome 9, 16 and 18. Chromosome 4, 9, 10, 12 and 16 had mostly amplifications (Fig. 5). The pattern of CNVs in tumor samples of 510 patients has a marked difference from the CNV pattern in tumor samples of 55 patients (Fig. 4). Chromosome 1 has the highest number of gene aberration with 3124 amplified or deleted genes followed by Chromosome 6 with 2911 aberrant genes and Chromosome 3 with 2149 aberrant genes. Top ten significantly amplified and deleted genes which are all from chromosome 1 are shown in Table 4.

Fig. 5
figure 5

Significant CNVs on all chromosomes in tumor samples of 510 patients with LUAD. Orange line shows the q-value threshold (0.01) in -log10 base. Red lines which have positive scores shows the amplification of genomic regions on specified chromosome; blue lines which have negative scores shows the deletion of genomic regions on specified chromosome. The genomic regions above threshold were selected for further gene enrichment analysis. Amp: Amplification, Del: Deletion

Table 4 Top ten significantly deleted and amplified genes in tumor samples of 510 patients with LUAD. After gene enrichment analysis, the significantly amplified and deleted genes were listed with their aberration type, q value, genomic aberration region and gene region on specified chromosomes

Differential expression analysis (DEA)

The Transcriptome Profiling data of LUAD patients in mRNA expression level (as unnormalized HTSeq raw counts), was downloaded by TCGABiolinks R package. Differentially expressed genes were determined with FDR adjusted p-values (q-values) in tumor samples (TP) of 55 patients with LUAD compared to normal samples (NT) of the same patients by the limma-voom method using limma and edgeR R packages. The volcano plot in Fig. 6, shows the differentially expressed genes (DEGs) as dots of which black ones represent the genes which have differential expression less than two-fold and not significant while red ones represent upregulated and green ones downregulated more than two-fold (log2 = 1) significantly (q value < 0.01). As a result of this analysis, 3575 genes were dysregulated more than two-fold with 0.01 q-value significance.

Fig. 6
figure 6

Volcano plot of differentially expressed genes in tumor samples of 55 patients with LUAD. Red dots represent up regulated genes more than two-fold significantly (q-value < 0.01) and green dots represent down regulated genes more than two-fold (log2=1) significantly (q-value < 0.01). The black dots represent the genes which have differential expression less than two-fold and/or not significant

As the result of DEA, differentially expressed genes (DEGs) are determined with their log Fold Change (logFC), adjusted p-value (q-value), entrez gene IDs and HGNC symbols after enrichment analysis. The top 10 down-regulated and up-regulated genes are shown in Tables 5 and 6. The list of DEGs was used for pathway analysis and active subnetwork analysis.

Table 5 Top ten significantly down-regulated genes in tumor samples of 55 patients with LUAD. After gene enrichment analysis by using Ensembl gene ID of DEGs, down-regulated genes were listed with their entrez gene ID, HGNC symbol, log2 fold change value and q-value (adj.P.Val)
Table 6 Top ten significantly up-regulated genes in tumor samples of 55 patients with LUAD. After gene enrichment analysis by using Ensembl gene ID of DEGs, up-regulated genes were listed with their entrez gene ID, HGNC symbol, log2 fold change value and q-value (adj.P.Val)

Active subnetwork and pathway analysis

The output of Differentially Expression Analysis (DEA) containing differentially expressed genes with their Ensembl IDs and adjusted p-values (q-values) were used as input of DEsubs R package. The active subnetworks of differentially expressed genes in tumor samples of both 55 LUAD patients were determined by DEsubs package and results were represented as graphs at subnetwork and organism levels. DEsubs package identified 35 subnetworks including 192 genes, 14 of them including more than three genes, 8 of them including three genes and the others including two genes. In Fig. 7, the top ten significant genes which play a role in determined subnetworks are represented with their q-values. These genes are FABP4, WNT3A, EDNRB, TEK, AGER, EPAS1, ACADL, PDIA4, ANGPT4, KL. In this analysis, 35 subnetworks were determined and the first three subnetworks are presented in Fig. 8, 9 and 10. When we look at the subnetworks’ graphs, in subnetwork 1 (Fig. 8), the prominent genes are WNT genes which are members of WNT pathway, a major evolutionary conserved signaling pathway playing role in cell differentiation, cell migration and organogenesis during development and highly related to lung cancer; in subnetwork 3 (Fig. 10), the prominent gene is AKT3 which is one of the AKT family members which play role in tumorigenesis and are modulators of several tumors. The pathways of subnetwork genes are mostly cancer related pathways such as melanoma, glioma, colorectal cancer, chronic myeloid leukemia, basal cell carcinoma, apoptosis, erbb signaling, jak-stat signaling and map kinase signaling pathways (Fig. 11).

Fig. 7
figure 7

Top 10 significant active subnetwork genes in tumor samples of 55 LUAD patients. The values on x-axis showed the -log10(q-value), thus higher value means higher significance level

Fig. 8
figure 8

First active subnetwork from PPI network in tumor samples of 55 patients with LUAD. DEsubs R package identified active subnetworks by using q-values of DEGs determined at previous step. Red graduation in nodes indicate the q-value degree, the edge width indicates the correlation degree between the respective genes. Green or red color in edges indicates the positive or negative correlation respectively

Fig. 9
figure 9

Second active subnetwork from PPI network in tumor samples of 55 patients with LUAD. DEsubs R package identified active subnetworks by using q-values of DEGs determined at previous step. Red graduation in nodes indicate the q-value degree, the edge width indicates the correlation degree between the respective genes. Green or red color in edges indicates the positive or negative correlation respectively

Fig. 10
figure 10

Third active subnetwork from PPI network in tumor samples of 55 patients with LUAD. DEsubs R package identified active subnetworks by using q-values of DEGs determined at previous step. Red graduation in nodes indicate the q-value degree, the edge width indicates the correlation degree between the respective genes. Green or red color in edges indicates the positive or negative correlation respectively

Fig. 11
figure 11

KEGG pathways of active subnetwork genes in tumor samples of 55 patients with LUAD

Statistical analysis

In order to identify a molecular prognosis risk model, the clinical data of all patients in TCGA LUAD project (Table 7) was downloaded by TCGAbiolinks R package and separated as training data of 55 LUAD patients who have paired samples for RNAseq data and used for gene signature construction; and test data of remaining 422 LUAD patients after removing patients who have missing values in clinical data. Different gene signatures were generated from the genes which have the prognostic ability. The univariate cox regression analysis was performed for significant SNV genes, significant CNV genes, significant DEGs and active subnetwork DEGs in tumor samples of 55 patients with LUAD. There were 38 CNV genes, 463 DEGs and 37 subnetwork DEGs (DEsubs) with prognostic ability after univariate analysis and logRank test (p < 0.05). SNV genes did not have significant prognostic ability. Then different data categories (DEGs; DEsubs; CNVs; CNVs + DEGs, CNVs + DEsubs; CNVs + DEGs + SNVs; CNVs + DEsubs + SNVs) were generated by using significant prognostic genes. These data categories underwent the Cox proportional hazards regression with the Lasso penalty and LOOCV. Gene models from different categories were generated by using glmnet R package which gives active genes with their coefficients. The genes in the models were DEPTOR, ZBTB16, BCHE, MGLL, MASP2, TNNI2, RAPGEF3, SGK2, MYO1A, CYP24A1, PODXL2, CCNA1 from DEGs category; THRA, RAPGEF3, LAMB2 from DEsubs category; SNX13, AC080080.1, RNMTL1P2, AC080080.2 from CNVs category; THRA, RAPGEF3, LAMB2 from CNVs + DEsubs. The genes in CNVs + DEGs and CNVs + DEGs + SNVs categories were the same with the genes in the DEGs category; the genes in CNVs + DEsubs + SNVs were in the CNVs + DEsubs category. Then, c-index analysis was performed to identify the survival predictive ability of the gene models identified from different categories. The higher c-index score was 0.858 from DEGs gene model (Fig. 12). This gene model (BCHE, CCNA1, CYP24A1, DEPTOR, MASP2, MGLL, MYO1A, PODXL2, RAPGEF3, SGK2, TNNI2, ZBTB16) was chosen as the best candidate prognosis gene signature for LUAD datasets.

Table 7 Summary of clinical features of 55 and 510 patients with LUAD
Fig. 12
figure 12

The c-index scores of different gene categories in training data and selected signature in testing data. C-index scores were calculated for different gene categories (DEGs; DEsubs; CNVs; CNVs + DEGs, CNVs + DEsubs; CNVs + DEGs + SNVs; CNVs + DEsubs + SNVs). The categories which have unique signature were used for c-index scoring. CNVs: copy number altered genes, DEGs: Differentially expressed genes, DEsubs: DEGs in active subnetworks

Multivariate Cox regression analysis was performed for the genes in the chosen 12-gene signature and risk scores of each patient in training data (55 LUAD patients) were calculated by using coefficient values and normalized expression values (log2 + 1) in tumor samples. Then the patients were clustered into high-risk and low-risk groups by using maxstat (maximally selected rank statistics) method based on optimal cut-points for numerical variables by using survminer R package (Fig. 13a). When we performed Kaplan-Meier (KM) survival analysis to demonstrate the overall survival of risk groups stratified based on gene signature, patients with high-risk score demonstrated poor overall survival (p < 0.0001) than those with the low-risk score in training dataset (Fig. 13b).

Fig. 13
figure 13

Risk clustering of 55 patients with LUAD and KM survival analysis for training data. a Risk score distribution and clustering of patients based on maximally selected standardized log-rank statistic. The patients were clustered into high-risk and low-risk groups based on cutpoint value of the maxstat (maximally selected rank statistics) method. b Kaplan-Meier survival plot was generated for high-risk vs low-risk group of 55 patients with LUAD. Time parameter shows the days of overall survival

The ROC curve analysis was performed to compare the sensitivity and specificity of the predictive ability of risk score based on the chosen gene signature. AUC values were 0.883 for 1-year, 0.813 for 2-year, 0.943 for 5-year and 0.976 for 10-year survival prediction (Fig. 14a). These high AUC values showed that the risk scores calculated based on the chosen 12-gene signature can highly predict the overall survival.

Fig. 14
figure 14

ROC curve analysis for 1, 2, 5 and 10-year survival prediction by the signature. a ROC curve analysis for the 12-gene signature in training data. b ROC curve analysis for the 12-gene signature in test data. AUC score represents the area under the curve which gives probability of accuracy of the prediction

When we performed the correlation analysis between tumor stages, mutation counts and gene expressions of signature genes, there was a significant difference of tumor stages between risk groups although there was no difference of total SNV mutation count between groups (Fig. 15). However, as expected gene expression levels were significantly different between high-risk and low-risk groups in training data (55 LUAD patients) (Fig. 16). The expression levels of the BCHE, DEPTOR, MASP2, MGLL, MYO1A, PODXL2, RAPGEF3, SGK2, TNNI2, and ZBTB16, genes were lower in the high-risk group while the expression levels of the CCNA1 and CYP24A1 genes were higher in high-risk group (Fig. 16).

Fig. 15
figure 15

Correlation analysis between risk groups with total mutation count and tumor stage. a Total mutation count in tumor samples of training data was not significantly correlated with risk groups. b Tumor stage was correlated with risk groups and higher in high-risk group

Fig. 16
figure 16

Violin plots showing the expression levels of the 12-signature genes in training data

In order to validate our signature, we calculated c-index for the prediction of the overall survival of the 442 TCGA patients with LUAD (test data) and the c-index was 0.591 which is lower than the c-index of training data (Fig. 12). Then, multivariate cox regression analysis was performed for the signature genes in test data. The risk score for each patient was calculated by adding the multiplication of the normalized gene expression level in tumor samples and the multivariate Cox regression coefficient value of each gene in the signature. Patients in the test dataset were divided into high-risk and low-risk groups by using maxstat (maximally selected rank statistics) method from using survminer R package (Fig. 17a). Patients in the high-risk group had poor overall survival significantly (p < 0.00055) (Fig. 17b). The ROC curve analysis was performed to compare the sensitivity and specificity of the predictive ability of risk score in the test dataset. AUC values were 0.479 for 1-year, 0.571 for 2-year, 0.622 for 5-year and 0.676 for 10-year survival prediction (Fig. 14b). The AUC values of risk scores calculated based on chosen gene signature were very low according to the AUC values of training data. Although the survival predictive ability (c-index) of our gene signature and AUC values of the risk score in test data was low, our 12-gene signature could separate patients into two groups which have a significant overall survival difference (Fig. 17b).

Fig. 17
figure 17

Risk clustering of 422 patients with LUAD and KM survival analysis for test data. a Risk score distribution and clustering of patients were performed based on maximally selected standardized log-rank statistic. The patients were clustered into high-risk and low-risk groups based on cutpoint value of the maxstat (maximally selected rank statistics) method. b Kaplan-Meier survival plot generated for high-risk vs low-risk group of 422 patients with LUAD. Time parameter shows the days of overall survival

We performed the correlation analysis between tumor stages, mutation counts and gene expressions of signature genes for test data, there was a slight significant difference of tumor stages between risk groups although there was no difference of total SNV mutation count between groups (Fig. 18). The gene expression levels of 6 signature genes (BCHE, CCNA1, DEPTOR, MASP2, MGLL, TNNI2) were significantly different between the high-risk and low-risk groups however, the gene expression levels of other 6 signature genes (CYP24A1, MYO1A, PODXL2, RAPGEF3, SGK2, ZBTB16) do not have significant difference in test data. The expression levels of the CCNA1 and TNNI2 genes were lower in the high-risk group while the expression levels of the BCHE, DEPTOR, MASP2 and MGLL genes were higher in the high-risk group (Fig. 19).

Fig. 18
figure 18

Correlation analysis between risk groups with total mutation count and tumor stage. a Total mutation count in tumor samples of test data was not significantly correlated with the risk groups. b Tumor stage was correlated with the risk groups of test data and slightly higher in the high-risk group

Fig. 19
figure 19

Violin plots showing the expression levels of the 12-signature genes in test data

Discussion

Lung adenocarcinoma (LUAD) is the most common form of lung cancer which is the most common cancer and responsible for the largest number of deaths worldwide. In order to characterize genomic and transcriptomic abnormalities of lung cancer and to determine the clinical status of patients, integrative analysis have been performed by using different types of molecular data. Recently, prognosis risk signatures have been generated to cluster patients with lung adenocarcinoma. However, mostly gene expression data has been used for this purpose. In this study, we performed an integrative analysis by using level-3 data of SNVs, CNVs and RNAseq data of patients with lung adenocarcinoma in TCGA project. We aimed to identify genomic and transcriptomic abnormalities that might be used to generate a molecular signature. We determined the significantly mutated genes; amplified and deleted genes; and differentially expressed genes (DEGs) significantly and their active subnetworks by using R packages. Then we performed univariate and multivariate Cox Proportional Hazards Regression (CPHR) analysis with LOOCV and the Lasso penalty to identify predictor genes on survival time of patients with lung adenocarcinoma.

Firstly, we identified 6 and 82 mutated genes which are candidate driver genes in tumor samples of 55 LUAD patients and those of 510 LUAD patients, respectively. KRAS and EGFR oncogenes with TP53, STK11, RB1 and MGA tumor suppressors were mutated significantly in the small cohort of patients. The mutated 82 genes of a large cohort of patients include the 6 genes above and also previously identified lung adenocarcinoma related genes such as KRAS, TP53, STK11, RB1, NF1, RMB10, BRAF, KEAP1, CDKN2A, SETD2, ARID1A, SMARCA4 and MGA [5]; EGFR and ERBB2 [4, 5]; and PIK3CA [5, 6]. Besides, MAP2K1 and MAP2K4 mutations can be related with MAPK pathway activity as identified in the TCGA lung adenocarcinoma original article [5]. Loss-of-function MGA mutations with MYC amplification in lung adenocarcinoma have been newly described [5] and MGA gene was identified by SomInaClust analysis in our study. MGA, encodes MAX gene-associated protein which is a MYC-interacting transcription factor and antagonizes the transcriptional regulation of MYC involved in cancer processes [14].

We identified amplified and deleted genes which have copy number variations in tumor samples of patients with lung adenocarcinoma. We identified significant copy number altered genes which play role in immune system pathways, metabolism pathways with small cell lung cancer pathway and molecular mechanism of cancer pathway. We analyzed differentially gene expression in tumor samples compared to paired normal samples of 55 patients with lung adenocarcinoma and 3575 genes were dysregulated more than two-fold, significantly (q-value < 0.01). The upregulated genes mostly play role in cell cycle and proliferation pathways such as G2/M damage checkpoint regulation, cell cycle control of chromosomal replication, ATM signaling, hereditary breast cancer signaling, bladder cancer signaling and HIF1 signaling pathways. The downregulated genes play role in cAMP-mediated signaling, g-protein coupled receptor signaling, Gαi signaling and other immune system pathways such as complement system, granulocyte/agranulocyte adhesion and diapedesis, dendritic cell maturation and T helper cell differentiation. Then we determined the differentially expressed genes (DEGs) in active subnetworks of PPI network in tumor samples and we identified 192 DEGs in 35 subnetworks. These genes play role in mostly cancer related pathways such as melanoma, glioma, colorectal cancer, chronic myeloid leukemia, basal cell carcinoma, apoptosis, erbb signaling, jak-stat signaling and map kinase signaling pathways (Fig. 11).

We integrated the significant SNVs, CNVs, DEGs and DEGs in active subnetworks by performing multivariate Cox Proportional Hazards Regression (CPHR) analysis with LOOCV and the Lasso penalty after univariate CPHR, we determined a 12-gene expression signature (BCHE, CCNA1, CYP24A1, DEPTOR, MASP2, MGLL, MYO1A, PODXL2, RAPGEF3, SGK2, TNNI2, ZBTB16) which has 0.858 and 0.591 c-index score for training and test data, respectively. Moreover, this 12-gene expression signature had 0.883, 0.813, 0.943 and 0.976 AUC values for 1, 2, 5 and 10-year survival prediction, respectively, for training data. Same 12-gene expression signature had 0.479, 0.571, 0.622 and 0.676 AUC values for 1, 2, 5 and 10-year survival prediction, respectively, for test data. We clustered the patients for both training and test analysis, into the high-risk and low-risk group based on risk scores calculated by using expression levels and multivariate CPHR coefficients of 12 genes in the signature. Kaplan-Meier survival analysis showed highly significant overall survival difference between the high-risk and the low-risk groups for both training data (p < 0.0001) and test data (p = 0.00055).

All genes in the 12-gene signature are cancer-related and play role in lung cancer pathways which are the candidates of molecular targeting. BCHE (Butyryl cholinesterase) activity in lung adenocarcinoma is less than in the adjacent non-cancerous tissue [15]; and BCHE is one of two potential diagnostic markers in plasma/serum for non-small cell lung cancer [16]. CCNA1 (Cyclin A1) is a cell cycle regulator protein and was down-regulated in non-small cell lung cancer and CCNA1 promoter was hypermethylated in 70% of lung tumors which has wild-type p53, but was not methylated in cells with mutant p53 [17]. CCNA1 plays a role in p53-mediated G2 cell cycle arrest and apoptosis in non-small cell lung cancer cells and upregulation of cyclin A1 resulted in apoptosis [18]. However, Cho et al. determined that knock-down of CCNA1 using siRNA, induced apoptosis in non-small cell lung cancer cells [19]. CYP24A1 expression level was highly increased in lung adenocarcinoma compared to normal lung tissue samples and CYP24A1 overexpression was associated with poorer survival, increased cell growth and invasion, and increased RAS protein expression in lung adenocarcinoma [20,21,22,23]. Knockdown of CYP24A1 significantly decreased cell proliferation resulted in tumor growth delay and smaller tumor size with decreased RAS protein level, thus reducing phosphorylated AKT [21]. DEPTOR (DEP domain-containing mTOR-interacting protein), a natural mTOR inhibitor, was downregulated by activation of EGFR signaling. EGFR inhibition by Gefitinib resulted DEPTOR accumulation. DEPTOR inhibited proliferation, migration, invasion and the tumor growth of lung adenocarcinoma. DEPTOR induction inhibited EGFR mediated tumor progression [24]. DEPTOR depletion can induce EMT in cancer cells and DEPTOR plays a critical role in EMT regulation by BMK1 [25]. DEPTOR was also identified as one of the 77 clinically relevant predictive biomarker at TGFβ-EMT signature generated by microarray analysis of TGFβ-1 treated non-small cell lung cancer cells. TGFβ-EMT gene signature could predicted overall survival and metastasis-free survival in lung adenocarcinoma [26]. MASP-2 (Mannan-binding lectin-associated serine protease 2) is a plasma protein involved in lectin pathway of complement system which promotes cell differentiation, proliferation, migration and reduced apoptosis. Complement activation in the tumor microenvironment enhances tumor growth and increases metastasis [27]. High MASP-2 levels concentration in serum significantly correlated with recurrent cancer disease and with poor survival, thus the MASP-2 level had an independent prognostic value in the patients [28]. MBL/MASP complex activity was significantly increased in patients with colorectal cancer, too [29]. MGLL (Monoglyceride lipase) is highly expressed in aggressive human cancer cells and primary tumors, where it regulates a fatty acid network enriched in oncogenic signaling lipids that promotes migration, invasion, survival, and in-vivo tumor growth [30]. MGLL expression was significantly reduced in the majority of primary human lung cancers and primary colorectal cancers compared to normal tissues [31, 32]. MGLL suppressed colony formation in tumor cell lines and knockdown of MGLL resulted in increased Akt phosphorylation. MGLL plays a negative regulatory role in phosphatidylinositol-3 kinase/Akt signaling and tumor cell growth [32]. MGLL knock-out mice exhibited a higher incidence of neoplasia in lung [31]. MYO1A (Myosin I a) expression was higher in ever smokers than in never smokers [33]. MYO1A had mutations and promoter hypermethylation in patients with colorectal cancer and gastric tumors; therefore, lower levels of MYO1A expression was associated with faster tumor progress and poor prognosis [34, 35]. Podocalyxin is an anti-adhesive transmembrane protein played role in the development of more aggressive breast and prostate cancer [36, 37]. Podocalyxin (including PODXL1, PODXL2 and PODXL3) induction resulted in altered migration and invasion, increased MMP expression with increased MAPK and PI3K activity through forming a complex with Ezrin protein, in breast and prostate cancer [38]. Mammalian exchange protein directly activated by cAMP isoform 1 (EPAC1), encoded by RAPGEF3 gene, acts as guanine exchange factor for Ras-like Rap small GTPases [39]. EPAC1 expression was lower in lung cancer tissue compared to expression in normal specimens and associated with the degree malignancy and lymph-node metastasis [40]. SGK is one of three isoforms of the serum glucocorticoid regulated kinase family of serine/threonine kinases. SGK2 expression was upregulated in hepatocellular carcinoma and its downregulation inhibits cell migration and invasion [41]. Expression level of SGK1 was higher in squamous cell lung cancer and correlated with high grade tumors, tumors size and clinical stage [42]. Protein and mRNA expression of cardiac troponin I (TNNI3) were abnormally detected in non-small cell lung cancer tissues, lung adenocarcinoma cell line and lung squamous cell carcinoma cases while there was negative staining for TNNI3 in non-cancer lung tissues [43]. ZBTB16 (zinc finger and BTB domain containing 16), also known as the promyelocytic leukemia zinc finger protein (PLZF), was down-regulated in lymph node adenocarcinoma metastases and NSCLC samples by hypermethylation in the promoter region [44, 45]. Overexpression of ZBTB16 in lung cancer cell lines inhibited proliferation and increased apoptosis while the depletion of cytoplasmic PLZF was correlated with the high tumor grade, lymph node metastasis, the higher tumor stage and the shorter overall survival [44, 45]. ZBTB16 was also down-regulated in never smoker patients with lung adenocarcinoma [46] and non-small cell lung cancer high-metastatic cell line compared with the low-metastatic cell line [47].

Although the 12-gene signature had low AUC values which means that this 12-signature is not the optimal prediction model, it can be used to cluster patients with LUAD into two risk groups. We could test the signature for different lung adenocarcinoma datasets and check AUC values for them, too. The power of these types of signatures can be increased by performing signature generation from larger cohorts or adding different types of data in order to increase the prediction potential. Although we generated different gene categories to integrate genomic and transcriptomic variations for prognostic risk prediction, DEGs had dominance over genomic alterations. This can be due to the fact that genomic alterations work as promoters which give rise to differential gene expression and this altered gene expression profile determined the new fate of the cell. Therefore, we need integration models for different types of biological data which are not independent of each other. We need also new models for patient-based analysis and/or integration of different data types.

Conclusions

In this study we analyzed the significant SNVs, CNVs and DEGs in active subnetworks, which have impact on overall survival of patients with lung adenocarcinoma in the TCGA project. We determined 12-genes (BCHE, CCNA1, CYP24A1, DEPTOR, MASP2, MGLL, MYO1A, PODXL2, RAPGEF3, SGK2, TNNI2, ZBTB16) which are highly cancer or lung adenocarcinoma related. These 12 genes are candidates to be used as molecular signature for prediction of overall survival-based risk group of patients with lung adenocarcinoma. These genes can be used to cluster patients and determine the best candidates of drugs for the patient clusters which have different molecular nature. These genes also have potential for targeted cancer therapy of patients with lung adenocarcinoma.

Methods

Data

Simple Nucleotide Variation, Transcriptome Profiling, Copy Number Variation and Clinical data of both 55 LUAD patients who have paired (both normal and tumor samples) RNAseq data and of 510 patients who have all four types of data (among all patients in LUAD project) was downloaded separately from TCGA harmonized database by using R/Bioconductor TCGAbiolinks package [48]. We analyzed the genomic alteration data including Simple Nucleotide Variations, Copy Number Variations; and transcriptomic variations from RNAseq data, processed using the reference of hg38; and clinical data of LUAD patients (Table 7).

Identification of the significant simple nucleotide variations

The Mutation Annotation Format (maf) file contained somatic mutations of all patients in TCGA LUAD project, was downloaded using TCGAbiolinks package. The other R/Bioconductor package, maftools [49], were used to subset original maf file by tumor sample barcodes of patients of interest. Maftools package also summarizes the mutations and represents as summary plot and oncoplot. Significantly mutated genes divided into two groups, oncogene (OG) or tumor suppressor gene (TSG), among tumor samples of 55 and 510 patients were identified separately by using SomInaClust R package [50]. SomInaClust works on the basic assumption that important genes in tumor samples have clustered on sequence and high number of inactivating mutations because of the selective pressure during tumorigenesis. Based on this assumption, oncogenes have clustered mutations, while tumor suppressors have inactivating (protein truncating) mutations. SomInaClust uses a reference step in which background mutation rate and hot spots are determined for genes existing in reference mutation database such as the COSMIC database (v88) [51].

Identification of the significant copy number variations

The CNV dataset for primary solid tumor samples of patients with LUAD, generated by Affymetrix Genome-Wide Human SNP Array 6.0 platform, was downloaded using TCGAbiolinks package. The significant aberrant genomic regions in tumor samples of 55 and 510 patients were identified separately by R/Bioconductor GAIA package [52]. NCBI IDs and HUGO symbols of the genes with differential copy number were determined using biomaRt R package [53].

Differential expression analysis (DEA)

The Transcriptome Profiling data in mRNA expression level (as unnormalized HTSeq raw counts) of 55 LUAD patients who have paired samples was downloaded by TCGABiolinks package. Differentially expressed genes were determined with FDR adjusted p-values (q-values) in tumor samples (TP) according to normal samples (NT) of 55 LUAD patients by limma-voom method using limma [54] and edgeR [55] R/Bioconductor packages. NCBI IDs and HUGO symbols of the differentially expressed genes determined by the biomaRt R package.

Active subnetwork and pathway analysis

We identified the active subnetworks of differentially expressed genes in tumor samples of 55 LUAD patients using R/Bioconductor DEsubs package [56]. The output of limma package containing differentially expressed genes with their Ensembl IDs and FDR adjusted p-values (q-values) were used as input of DEsubs package. DEsubs package determines and represents the active subnetworks with their graphs both at subnetwork and pathway levels.

Statistical analysis

Clinical data of 55 and 510 patients was downloaded from TCGA database using the TCGAbiolinks package. Univariate Cox Proportional Hazards Regression analysis [57] and logRank test [58] were performed using survival R package [59] for the significant SNV containing genes, the significant CNV containing genes, the DEGs and the active subnetwork genes to identify genes with prognostic ability. For the genes with prognostic ability (p value < 0.05), multivariate Cox proportional hazards model with LOOCV and the Lasso penalty was used to identify the best gene signature among different combinations of molecular levels (SNV genes, CNV genes, DEGs and active subnetwork genes) by using glmnet R package [60]. Concordance index (c-index) was performed using pec R package [61] to validate the predictive ability of different gene signatures. The larger c-index is used to determine the gene signature which has more accurate predictive ability. Multivariate cox proportional regression analysis was performed using survival R package for genes of selected signature and the risk score of each patient was calculated using coefficient and expression values of the genes. Then, patients were clustered into the high-risk group and the low-risk group and Kaplan-Meier (KM) survival curves [62] were generated using survminer R package [63] to demonstrate the overall survival of risk groups stratified based on gene signature. ROC curve analysis [64] was also performed for risk scores calculated based on selected gene signature by using survivalROC R package.

Significant differences in the tumor stages, the mutation counts and the expression levels of patients in the high-risk and low-risk groups were identified using ggstatsplot R package [65]. In order to validate the prognosis risk signature, the risk scores of 442 TCGA patients with LUAD were calculated using the expression values of the gene signature and their coefficient values from multivariate Cox proportional regression analysis. Similarly, 442 patients (after exclusion of 55 and other patients with missing data from 510 patients) were clustered into high-risk and low-risk groups and the overall survival difference between the two groups of patients was assessed by KM survival curve. Significance level used for identification of genes containing copy number variations and differentially expressed genes, was 0.01 for FDR corrected q-value. Significance level was 0.05 for FDR corrected p values (q value) for identification of genes containing the significant single nucleotide variations; and was 0.05 for p-values for the active subnetwork and the pathway analysis, and for all the statistical analysis.