Background

Lung Adenocarcinoma in situ, is a pre-invasive non-small-cell lung cancer (NSCLC) lesion. The early diagnosed and appropriately treated AIS patients often have quite high survival rate (almost 100%) [1]. A fraction of AIS can develop into invasive cancer. The 5-year survival rate for the invasive lung cancer is decreased to 4% on average [2]. Presently, about 70% of the lung cancer cases are diagnosed at the invasive stage [3]. Several studies have investigated the progression of the lung cancer invasion [4, 5]. For example, Min et al. followed a case of lung cancer that evolved from a pure ground-glass opacity nodule into an invasive adenocarcinoma for 10 years and studied the growth and aggressiveness of the lung cancer [6]. Another study indicated that the protein transforming growth factor-ß (TGF-ß) induces epithelial-mesenchymal transition (EMT) in lung cancer cells and further mediates the tumor migration and invasion [7]. A recent study investigated RNA sequencing (RNA-seq) data generated from AIS and invasive lung cancer tissue samples and identified several genes that potentially involved in the progression from AIS to invasion [1]. However, the regulations of the genes and the underlying molecular mechanisms that govern the invasion progression are not well characterized.

We developed a Genetic Algorithm based method to infer lung cancer invasion-related gene networks. We first identified a set of genes that were differentially expressed in invasive lung adenocarcinoma by comparing gene expression alterations in normal, AIS, and invasive tumor tissues based on a RNA-seq dataset [1]. We found that these genes showed consistent expression patterns in a LUAD dataset from The Cancer Genome Atlas. Then, we employed a global optimal search algorithm to construct subnetworks for each invasive differentially expressed gene by integrating gene expression, protein-protein interactions (PPIs), protein-DNA interactions and lncRNA regulations. Further incorporation of driver mutation information, we revealed 19 core subnetworks that contained invasive specific genes and putative driver genes. These subnetworks can lead us to the discovery of new pathways responsible for invasive tumor progression.

Fig. 1
figure 1

Expression alterations of invasive-specific genes. The expression patterns (log2 fold change) of the invasive-specific genes identified by the comparison between normal and invasive, AIS and invasive samples, and TCGA LUAD normal and stage III samples

Results

Identification of invasive specifc genes

The RNA sequencing data of normal, AIS, and invasive tissue sampes for six lung cancer patients were collected from Gene Expression Omnibus (GSE52248) [1]. Differential expression analysis revealed diverse gene expression change patterns. We found that 98 genes were significantly differentially expressed between AIS and invasive (|FC| > 2 and FDR < 0.05). Among these genes, 61 were also differentially expressed in normal and invasive comparison (|FC| > 2 and FDR < 0.05). Hence, we considered these 61 genes to be lung invasive-specific differentially expressed genes (DEGs), which consisted of 52 protein-coding genes and 9 lncRNAs (Additional file 1: Table S1). The expressions of the invasive-specific genes are able to separate the 18 tissue types with different phenotypes by hierarchical clustering (Additional file 2: Figure S1), only one invasive tissue sample was misclustered. This sample was clustered together with an AIS sample from the same patient. This misclustered case might be related with the sample collection. We further validated these invasive-specific genes on an independent RNA-seq data for 59 normal and 84 stage III lung adenocarcinoma (LUAD) tissue samples obtained from the TCGA project. The hierarchical clustering based on the expression levels of invasive specific genes demonstrated two unique tissue clusters, normal and stage III LUAD, and only 5 of 143 (3.5%) tissue samples were mis-clustered (Additional file 3: Figure S2).

Of the 61 invasive-specific genes, 16 were upregulated and 45 were downregulated in the invasive tissues compared to their expression levels in normal and AIS tissues (Fig. 1). Similar regulation patterns were observed in normal vs. stage III LUAD (TCGA) comparison with a few exceptions. In the TCGA samples, TRIM9 and CYP4F3 were expressed in the opposite manner, and the other two protein-coding genes and four lncRNAs were not differentially expressed (Fig. 1). The functional annotation by DAVID [8] of the upregulated and downregulated gene sets revealed several cancer related biological processes. The inflammatory response (P-value = 0.054, downregulated genes) and negative regulation of cell growth (P-value = 0.063 upregulated genes) were enriched, indicating their roles in invasive cancer development (Additional file 4: Table S2).

Putative driver somatic mutations

The somatic mutations were identified using MuTect2 (Table 1, Methods) based on paired RNA-seq data (normal and AIS, normal and invasive). We found a total of 271,064 and 273,292 significant somatic mutations in AIS and invasive lung tissues respectively. Then we employed Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM) to predict driver mutations [9]. Our results showed that 362 of the 6445 mutated genes in AIS tissues reported as driver genes, while 411 of the 6509 mutated genes in invasive tissues were identified as drivers (CHAMS score > 0.8 and P-value < 0.05, Methods).

Table 1 Identification of somatic mutations and putative driver genes in LUAD

Meanwhile, the somatic mutation data of stage III TCGA lung adenocarcinoma were collected. Stage III tumors usually have increased size, extent, or degree of penetration, but no distant metastasis [10]. We found 201 significant putative LUAD driver genes based on TCGA stage III cases (CHASM score > = 0.8 & P-value < 0.05). There was a total of twenty-seven common putative driver genes between the TCGA stage III dataset and AIS lung cancer dataset. The two patient datasets were independent. The relative low concordat rate (13.4% and 6.6% of TCGA stage III and AIS datasets respectively) here may attribute to the heterogeneity of the disease. Out of these putative driver genes, 68.2% (137/201) were consistently highly expressed (median FPKM > 1) in both the normal and invasive samples, while only 29.9% (60/201) were differentially expressed genes.

Lung adenocarcinoma invasion associated gene subnetworks

Next, we constructed subnetworks to explore the relationships between putative driver genes and differentially expressed genes in the disease. Our analysis showed that the majority invasive-specific genes had no driver mutations (98.4%, 60/61), ADGRL3 (CHASM score = 0.824, P-value = 0.0044) was the only predicted LUAD driver gene, while most (70.1%) putative driver genes were not differentially expressed. Hence, the subnetworks built based on the integration of mutations and expression profiles can connect genotype with transcription and potentially reveal novel pathways in the disease. We then adopted GA to search for the gene groups that were associated with lung cancer invasion. Based on the fitness score, GA searches the optimal resolutions in generations, potentially yielding global optimum subnetworks that discriminate tissues with different phenotypes.

We used each invasive-specific differentially expressed gene as a seed for the subnetwork construction. All the genes directly or indirectly interacted with the seed genes through PPIs and protein-DNA interactions formed the original chromosome which is a potential solution for GA (Methods). The indicators, 1 and 0 values, indicate whether the relative gene would be collected or not. For each generation of GA, the top 5% of the chromosomes with the smallest fitness scores were kept and passed no the next generation. The fitness score here was calculated based on the mutual information. The LUAD seed genes were always kept in the gene set for calculating the fitness scores. Compared to the greedy algorithm, the GA search resulted in the subnetworks that can better distinguish invasive LUAD from normal cases. We used 500 generations in GA searching. Our data showed that no further improvement was observed by increasing the number (around 300) of generations (Additional file 5: Figure S3). Among the final networks identified by GA, some seed genes were associated with multiple subnetworks with the same smallest fitness score (Fig. 2). On the other hand, three seed genes had no interacted genes, either by PPI or protein-DNA interaction. Presently, neither PPI nor protein-DNA interaction information were available for the lncRNAs. Hence, the interactions between lncRNAs and genes were based on the results from the GENIE3 [11] for the construction of the subnetworks using lncRNAs as seed (Methods). Furthermore, we generated a single core subnetwork that was associated with each of the seed genes (Methods). Nineteen (31.1% of 61 seed gene) core subnetworks that contained at least one predicted driver gene were revealed (Additional file 6: Table S3). Two statistical tests based on random gene members and random tissue types (Methods) suggested that majority core networks were statistically significant (Additional file 7: Figure S4).

Fig. 2
figure 2

The distribution of the number of the subnetworks associated with each of the LUAD invasion seed genes

AGER, an invasive-specific gene, is associated with inflammatory response. Inflammation is an important factor of cancer development including lung cancer. The core subnetwork of AGER that consisted of 30 genes with two predicted driver genes and two transcription factors was constructed (Fig. 3). The functional analysis of this subnetwork using DAVID identified several cancer-related KEGG pathways. The top enriched pathway was the VEGF signaling pathway (P-value = 1.4E-6 and adjust P-value = 1.8E-4). The VEGF pathway has been reported to regulate tumor angiogenesis and drive the renal cell carcinoma progression [12, 13]. KRAS was a putative lung cancer driver gene. This gene interacted with MAPK3 and indirectly interacted with PTGS2, RAC1, and AGER in the core subnetwork (Fig. 3). In the VEGF signaling pathway [14], KRAS locates at the upstream of MAPK3 and they are both involved in the function of cell proliferation (Additional file 8: Figure S5). Thus, these invasive-specific based subnetworks (Additional file 9: Figure S6) can lead to identifying novel pathways involving in cancer invasive process.

Fig. 3
figure 3

The core subnetwork of the seed gene AGER. AGER was under-expressed in invasive tumor cases. KRAS and PIK3CA were driver genes (yellow and square) predicted by CHASM. SP4 and HIF1A were lung cancer specific transcription factors (orange) which regulated the expression of AGER

HNF4A was another invasive-specific gene suggested by our study. This gene is one of the best-known tumor suppressors in liver and pancreas [15] and is related to the negative regulation of cell growth, a biologic process contributing to the tumor development and growth [16]. The core subnetwork associated with HNF4A (Fig. 4) showed its interaction with PDGFRA, a gene encode a cell surface tyrosine kinase receptor for members in the platelet-derived growth factor family [17]. HNF4A plays a role in organ development, wound healing, and tumor progression. Overexpression of this gene potentially promotes tumor progression and indicates poor prognosis [18]. PDGFRA harbored somatic mutations and was predicted as a driver gene in lung invasive cancer. Thus, abnormal expression of HNF4A in the disease could be the consequence of the mutations in PDGFRA.

Fig. 4
figure 4

The core subnetwork of the seed gene HNF4A. The gene HNF4A is a lung adenocarcinoma related transcription factor (TF). PDGFRA was predicted as a lung cancer driver gene

Discussion

The investigation of the process of the lung cancer developing from an unfatal subtype, such as AIS, to the invasive stage provided the insights for understanding the mechanisms responsible for deterioration of the disease. We combined the two independent datasets to infer invasive specific subnetworks. The gene expression alteration patterns tend to be more robust than somatic mutations in different patient groups. Almost 98% DEGs were the same in GSE52248 and TCGA LUAD patients. However, the putative somatic driver genes only have about the 13.4% overlap rate, reflecting the high genetic heterogeneity for the disease. Two genes, TRIM9 and CYP4F3, have opposite expression patterns between the two datasets which may be explained by the diverse isoform expression patterns such as HNF4A. Karthikeyani Chellappa, et al. found that the diverse isoforms of HNF4A, especially P2-HNF4α, showed different expression patterns in various tissue samples [19]. As a tumor suppressor, HNF4A is usually downregulated in tumor samples. Interestingly, this gene was over-expressed in lung invasive tumor samples than normal of both GSE52248 and TCGA data.

The size of the chromosome of GA affects the optimal solution that the algorithm is able to find. Here, the size of the chromosome equals to the number of the candidate genes which directly or indirectly interact with the seed genes. The maximum searching distance from the seed gene was three for our subnetworks construction. At the outermost layer of the subnetworks, the total number of candidate genes often reached 18,000, which covered the majority human protein-coding genes (~ 23,000). Compared to the greedy algorithm, GA can identify global optimum subnetworks associated with the disease. The fitness function is an important factor for GA searching. Here, we used mutual information to calculate fitness score, which was estimated using discrete expression bins derived from continuous expression values. When the sample size is small, the number of final subnetworks can rapidly increase with less stability. Thus, for a small sample size, GA-based network construction may need a different fitness function guiding the searching process. In general, we found that a larger sample size could lead to more stable optimal gene groups.

Conclusions

We developed a novel GA-based network construction method for inferring gene subnetworks associated with invasive lung adenocarcinoma. The method integrated gene expression, PPI, transcription factor and gene interaction, and lncRNA regulation to uncover global optimal subnetworks underlying invasive progression. The two independent patient datasets were used to derive invasive-specific differentially expressed genes. The 19 core subnetworks associated with invasive-specific genes contained at least one putative driver genes and were significantly enriched in several biological processes and pathways involved in tumor growth. These results could enhance our understanding of cancer progression, which helps to develop stategies for preventing the cancer invasion and improving the survival of cancer patients.

Methods

Identification of differentially expressed genes (DEGs)

RNA-seq data (GSE52248) generated from normal, AIS, and invasive tissues of six patients were downloaded from GEO. The sequencing quality was assessed by FastQC. The low-quality reads were trimmed by Trimmomatic (v0.36, LEADING:28 TRAILING:28 SLIDINGWINDOW:4:24 MINLEN:70) [20]. Tophat2 (v2.1.0) was applied for reads alignment and human genome hg38 was used as a reference genome for the alignment [21]. HTSeq-count (v0.8) [22] and Cufflinks (v2.2.1) [23] were performed for calculating the raw read-count and Fragments Per Kilobase of transcript per Million (FPKM), respectively, based on the gene annotation of Ensembl version GRCh38.87. After filitering out the unexpressed genes with median raw count equal to zero, edgeR [24] was used for differential expression analysis. The genes have the absolute fold-change greater than 2 with FDR < 0.05 were considered as significantly differentially expressed between different tissue types.

Putative driver mutation identification

The paired RNA-seq reads of GSE52248 for normal vs AIS, normal vs invasive lung tissues samples were passed to MuTect2 [25]. The normal samples were used as controls in the comparisons to obtain somatic mutations. We further collected the mutation profiles (VCF format) of 84 stage III lung adenocarcinoma cases from TCGA as comparable invasive tumor samples. After the PASS filtering of MuTect2, the resulting somatic mutations were fed into CHASM-5.2 [9], an online tool that calculates the mutation scores and then reports the putative driver genes. The lung adenocarcinoma was used as the disease type for prediction. The somatic mutations with score > 0.8 and P-value < 0.05 were predicted as putative driver mutations. The CHASM score ranged from 0 (likely passenger) to 1 (likely driver). The P-value is an empirical value representing probability that a passenger mutation is misclassified as a driver.

Global subnetworks construction by GA

For each seed gene, the genes that interacted directly or indirectly with it through protein-DNA (TF and target genes) interactions and PPIs were considered as the candidate genes of the network. The maximum radius of the network from the seed gene was set as three. The unexpressed genes (median FPKM < 1) were removed from the candidate gene sets. The non-redundant PPIs were collected from five databases: intAct, MINT, BioGrid, DIP, and Reactome [26,27,28,29,30]. The lung cancer specific transcription factor and target gene pairs were downloaded from Regulatory Circuits [31]. GENIE3 [11] was applied to infer the target genes of the lncRNAs. GENIE3 adopts Random Forest to predict the regulatory relationships between genes according to the expression levels. The top 200 target genes that were potentially regulated by the lncRNAs were selected for network construction.

The R package genalg [32] was used for performing the GA analysis. We used the binary GA (0 represents the correspond gene is unselected, whereas 1 means the genes is selected) to search the optimum subnetwork member genes. The length of the chromosome is equal to the number of the candidate subnetwork genes for each seed. The mutation rate was set as 5%, and the argument zeroToOneRate of the genalg was 19 for controlling the gene selection. A larger zeroToOneRate value results in a smaller number of genes remained in each generation. To find the core subnetwork, the subnetworks that did not contain any of putative driver genes were removed first. Then, we calculated the frequency of each gene in the remaining subnetworks and filtered out the genes with low frequency. Here, the cutoff frequency was set as 50%. We further conducted two statistical tests to evaluate the significance of subnetworks compared to genomic background. We constructed two null distributions of fitness scores through permutation of the sample labels and randomly selected network members, respectively, for 1000 times. Then, we calculated the corresponding P-values for each core networks to assess its significance.