Background

The rapid development of high throughput gene expression profiling technology such as microarray and high throughput sequencing has enabled the development of many new bioinformatics data analysis methods for identifying disease related genes, characterizing disease subtypes and discovering gene signatures for disease prognosis and treatment prediction. For instance, in breast cancer research, a supervised approach was adopted to select 70 genes as biomarkers for breast cancer prognosis [1, 2] and was successfully tested in clinical settings [3]. However, a major drawback of such approach is that the selected gene features are usually not functionally related and hence cannot reveal key biological mechanisms and processes behind the difference of the two patient groups.

In order to overcome this issue and identify functionally related genes associated with disease development and prognosis, several approaches have been adopted. One of such approaches is to use gene co-expression analysis. For instance, in [4] and [5], we carried out gene co-expression network analysis for biomarker discovery in different types of cancers.

The goal of gene co-expression network (GCN) analysis is to identify group of genes which are highly correlated in expression levels across multiple samples. The genes in the same co-expression sub-network are often enriched with similar functions. The metric to measure the correlation is usually the correlation coefficient (e.g., Pearson correlation coefficient or PCC) between expression profiles of two genes [68]. Then for each dataset, a weighted graph can be derived with the vertices being the genes and the weights of the edges being the PCC values between the two gene expression profiles. However, many network mining algorithms take only binary edges by imposing a threshold on the PCC values (i.e., two genes are connected by an edge only if the PCC value between them is higher than a pre-defined threshold) and transforming the network into a sparse unweighted gene co-expression network (UWGCN). For instance, in [6], an algorithm called CODENSE was developed to identify frequent UWGCNs from multiple datasets and this method has been applied to cancer biomarker discovery. Issues with the UWGCN approach include how to determine the threshold of PCC values and the threshold may be too rigid to include edges with weights around that threshold. Thus weighted GCN (WGCN) methods have been developed.

For WGCN, Stephen Horvath's group has developed a series of methods for identifying gene clusters which are highly correlated using hierarchical clustering based approach [7, 9, 10]. This method was applied to identify disease associated genes such as the ASPM gene in glioblastoma [9]. However, there are several drawbacks of using the hierarchical clustering approach. First, it does not allow direct control over the intracluster connectivity such that the vertices within a cluster have high correlations on average. Second, the clustering approach does not allow shared genes between two sub-networks even though in biology, many genes have multiple functions and can be shared by multiple functional groups and dense sub-networks. Finally, clusters identified using this approach are often large (e.g., more than 100 genes), thus smaller gene networks which contain subtle functional information may not be detected.

In this paper, we take advantage of the dense sub-network finding method in the graph mining community and apply it to mine functional networks using the WGCN approach to identify dense co-expression sub-networks in glioblastoma. Specifically, using The Cancer Genome Atlas (TCGA) data sets, we identified co-expressed sub-networks (sub-networks for short in the following) for genes then we tested if these sub-networks can be used as features to separate patients into groups with different survival times. Using this approach, we identified 16 gene networks associated with GBM prognosis. Our results not only confirmed previous findings in GBM, but also suggested important epigenetic events (histone acetylation) in GBM development and prognosis.

Methods

Gene expression dataset for GBM

We downloaded gene expression data from the Cancer Genome Atlas (TCGA) project webpage (http://cancergenome.nih.gov, downloaded on 11/24/2010) for all GBM patients with gene expression data generated using Affymetrix HU133 Genechip. We also downloaded all available public clinical data including survival information. In total, we selected 361 patients with complete data (i.e., each has one set of gene expressions, one set of microRNA expressions, and public clinical information). Among them, 345 have a valid vital status (i.e., either LIVING or DECEASED) and they are good for survival tests. The gene expression data were normalized using RMA normalization as described in the TCGA NCI Wiki.

Building WGCN for genes

After normalization a total of 12,042 unique genes were available. PCC were computed between every pair of genes. We then set the genes to be the vertices of the WGCN with the absolute values of PCC (|PCC|) being weights of the edges.

Identify quasi-cliques in the WGCNs

We first define the density of a weighted network with N vertices with w ij being the weight, normalized between 0 and 1, between vertices v i and v j (i = 1, 2,..., N, j = 1, 2,..., N, and i≠j ) as d= i = 1 N - 1 j = i + 1 N w i j N ( N - 1 ) / 2

For mining densely connected networks in the WGCN, our approach is based on an existing algorithm previously developed for mining weighted networks [11]. Different from many graph mining approaches (e.g., [12]) that focus on unweighted graphs, the algorithm of [11] targets primarily at identifying dense components (or sub-networks) in a weighted graph (i.e., each edge has a weight), although it is called Quasi-Clique Merger (QCM). To mine dense-sub-networks in a gene-coexpression network, we slightly revise the original QCM algorithm by removing the hierarchical construction which does not contribute to our dense-sub-network finding, and changing the new search start condition from checking vertex coverage to checking edge coverage to ensure that each edge with its weight no less than the weight threshold (γ times the maximum edge weight) will be covered by at least one dense-sub-network. The revised algorithm is sketched below:

Algorithm 1 eQCM (edge-covering Quasi-Clique Merger, a revised version of QCM [11]. Input G=(V, E), γ, λ, t, β, Output: C )

1: Sort edges in E in descending order of their weights;

2: for i = 1:μ {eμ is the last edge in the above sorted list with weight ≥γ·e1}

3: if ei is an edge in any sub-network in C

4: continue;

5: endif

6: C = V(ei); U = V \ V(ei);

7: while max{v∈U}(contribute(v,C)) ≥ threshold

8: C = C ∪ {v};

9: U = U \ {v};

10: endwhile

11: C = C ∪ {C};

12: endfor

13: Merging highly overlapped sub-networks in C with respect to β;

14: Output C ;

To be consistent with the original QCM algorithm [11], contribute (v, C) is defined as the ratio of the edge weight increase of G(C) on adding the vertex v, over the size of C, and threshold is 1 - 1 2 λ | C | + t *density (G(C)), which is determined by the input parameter λ, t, the size of C, and the density of the sub-network induced by C. Readers may refer to [11] for additional details. The last second step (merging) is the step 4 in the original QCM algorithm. Since we are interested in identifying gene sub-networks with potential consistent functions, we select only the sub-networks with at least 10 genes to facilitate gene function enrichment analysis.

Survival test for identified networks

For each sub-network, we test if the genes in it can be used as potential prognostic markers for predicting GBM survival. For a network with k genes, we extract the expression values for them for all patients and use them as the feature vectors for patients. The patients are then divided into two groups using the unsupervised K-means clustering algorithm (K = 2, 100 time replicates, correlation distance measure).

The survival times for the two patient groups are plotted in Kaplan-Meier curves and the difference between the two groups is tested using log-rank test (code at http://www.mathworks.com/matlabcentral/fileexchange/20388). P-values for the log-rank tests for all the identified networks are recorded.

Select representative sub-networks with significant p-values

Since many of the identified networks have large overlaps, we cannot directly apply multiple test compensation method such as the Bonferroni correction as the tests are not independent and such correction would be too conservative. Instead, we design a randomized test to determine the false discovery ratio (FDR) for selecting significant sub-networks.

For an N-gene sub-network, we randomly selected a list of genes from the entire gene list in the dataset such that the expected length of the selected gene list is N. Then we repeat the survival test process as described above. Such random test is repeated 1000 times. The lower 5th percentile of the 1000 p-values is used as the threshold for p-value for selecting sub-networks with significant prognostic power. Since we have a large number of sub-networks and cannot carry out 1000 random tests for every possible N, we do such random tests for N = 1*101, 2*101,...9*101, 1*102, 2*102,... and the p-value thresholds are p10, p20,..., p100, p200,... Our results show that the p-value thresholds are close when N are close. Thus for a sub-network with size N', its p-value for survival test is compared to p i where i = N 10 lg N * 10 lg N to determine if it is significant. For example, a gene list with 28 genes compares its p-value to p 20 , and a gene list with 250 genes compare its p-value to p 200 .

We also noticed that many of the selected significant sub-networks have substantial overlaps and they form exclusive clusters. To identify such clusters, we iteratively merge networks with substantial overlaps (i.e, the overlap ratio r between two networks is larger than 50%) into clusters. The overlap ratio between two sub-networks G1=(V1, E1) and G2=(V2, E2) is defined as | V 1 V 2 | m i n ( V 1 , | V 2 | ) Then we pick the sub-network with the lowest p-value in each cluster as the representative sub-network for further analysis.

For the representative sub-networks, we used TOPPGene (http://toppgene.cchmc.org) for gene ontology and pathway enrichment analysis without Bonferroni correction.

Results

Using the eQCM algorithm (γ = 0.7, λ = 1, t = 1, β = 0.99999), we identified 8,124 sub-networks with at least ten vertices in the WGCN. The survival tests were then carried out on them and 866 show p-values less than 0.05. In addition, random tests were performed to obtain p10, p20,..., p90, p100,..., p500 and all of them are smaller than 0.01. 170 sub-networks with significant p-values were selected and their densities range from 0.612 to 0.862. Then sub-networks with substantial overlaps (overlap ratio > 50%) were iteratively merged into sixteen clusters. The representative sub-networks for every cluster and their p-values and enriched GO functions are shown in Table 1. For cluster 1, the representative sub-network is highly enriched with genes involved in extracellular matrix organization (p = 8.22 × 10-7) which also engage in many important biological processes such as cell-cell signaling and immune responses. Indeed, the entire set of genes in cluster 1 are highly enriched with immune system process genes (p = 1.01 × 10-46). Figure 1 shows examples of the Kaplan-Meier curves for some of the representative sub-networks in separating the patients using the unsupervised K-means algorithm, and heatmaps for these sub-networks.

Table 1 List of representative networks with log-rank test p-values.
Figure 1
figure 1

Kaplan-Meier curves and heatmaps for two groups of patients with significantly different survival times identified using three networks. Top: Left - The Kaplan-Meier curve for the patients separated using the immune response network #1. Right - The heatmap of the gene expression values for the genes in the representative network #1. The vertical white line indicates the separation between short survival (left to the white line) and long survival (right to the white line) groups. The density of the representative network is 0.6928 and the p-value is 5.7 × 10-5. Middle: The Kaplan-Meier curve and heatmap for the chromatin modification network #6. The density of the representative network is 0.6951 and the p-value is 0.0016. Bottom: The Kaplan-Meier curve and heatmap for the mitotic cycle network #13. The density of the representative network is 0.6764 and the p-value is 0.006774.

Discussion

In this paper, we carried out a co-expression analysis on GBM gene expression data to screen for biological processes involved in patient prognosis. In previous studies, using co-expression analysis based on clustering algorithm, ASPM has been identified as an important target gene in GBM [9]. ASPM is involved in cell cycle and mitosis functions and many networks with ASPM were identified in our study. We also identified a mitosis related sub-network with a significant p-value in our study (sub-network #13 in Table 1). Besides cell cycle networks, immune response networks also prove to be critical in GBM development as shown in sub-networks #1 and #11, which is consistent with the previous report on the importance of immune and inflammation genes in GBM [13]. As shown in Figure 1, genes in sub-network #1 show higher expression levels in the short survival group. Since a characteristic of GBM is its high metastasis occurrence and extracellular and immune genes play important roles in metastasis, the genes in this group may be potential targets for treatment for reducing metastasis. An interesting observation is that two sub-networks (#2 and #6) related to chromatin modification are identified. Particularly in sub-network #6, histone acetylation genes are highly enriched including well known chromatin modification genes such as CTCF [14] and EP400 [15]. The expression levels of these genes show down-regulation in the short survival group which indicates a possibly reduced histone acetylation activity. Histone acetylation is an important epigenetic event [16] and our findings suggest that epigenetics may play an important role in GBM development and prognosis and ChIP-seq experiments targeting histone acetylation changes associated with GBM development may be necessary. These findings are subject to further cross-validation and experimental investigation. Besides genes, our approach can be applied to identify microRNA modules which show strong association with patient survival and the results can also shed light on microRNA transcription regulation.

Conclusions

In this paper, we introduced eQCM algorithm for mining dense network clusters in weighted graphs and used this approach to identify 16 gene networks associated with GBM prognosis on weighted gene co-expression network. Our results not only confirmed previous findings including the importance of cell cycle and immune response networks in GBM, but also suggested important epigenetic events in GBM development and prognosis.