Background

Through the development of high-throughput sequencing technology and collaborative projects such as The Cancer Genome Atlas (TCGA), the integrative analysis of clinical data and genomic data at different molecular levels has emerged as a prominent tool for improving our understanding of the biological mechanisms underlying cancer. Many computational attempts have been made to identify molecular abnormalities that affect clinical outcomes and therapeutic targets, by integrating multiple genomic profiles and clinical data [115]. In particular, the association between various genomic features and the clinical outcome of cancer patients has been studied extensively. Previous studies have often focused on the association between each single gene and clinical outcomes [1619], and have not been able to detect the combined effects of multiple genomic features. Other approaches are based on regression models that can describe the effects of multiple features. For example, the cox regression or sparse regression framework, like elastic net analysis, is effective in finding gene expression signatures associated with the overall survival of cancer patients [20]. However, these methods are limited to detection of the additive effect of multiple features on clinical outcome, and do not translate well for more general types of interaction effects.

More recently, network information either between patients or between genes has been shown to significantly improve the accuracy of predicting clinical outcomes, such as survival in cancer patients. Kim et al., developed an integrated framework by graph-based semi supervised learning, to handle multi-level genomic data for the prediction of clinical outcomes in ovarian serous cystadenocarcinoma [10]. The similarity network between patients is first constructed by using genomic feature values, and then the network information is utilized in learning the clinical label of new patients. Cox-regression for predicting cancer patient survival has also been successfully extended to incorporate the network structure among genes [21]. However, many of the existing networks used for such analyses are constructed either by a simple correlation approach between features, or taken from the existing knowledge base, such as protein-protein interaction networks. Neither type of network contains information about the effect of gene interactions on clinical outcomes, from a given dataset. In alternative ways, there were studies to consider effect of clinical outcomes of constructed networks. Vandin et al. proposed mutated sub-networks associated clinical outcome with HotNet algorithm [22, 23], Pauling et al. proposed network integration method with hybrid network construction and differential network mapping for condition specific key pathways [24]. However, these studies focused only interaction or association between single gene and clinical outcomes.

In terms of genomic features, gene signatures based on mRNA expression have been most widely investigated to date, while other features such as Copy-Number Alteration (CNA), miRNA, or methylation levels, are gaining more attention recently. For example, Gorringe et al., tried to identify genomic loci interactions of CNA in samples from ovarian cancer patients, although found no association with survivability [25].

In this paper, we propose a new integrative framework to identify interacting gene pairs that affect the clinical outcome of cancer patients. Our approach of Mutual information-based Integrative Network Analysis (MINA) allows systematic investigation of gene-gene interactions associated with clinical outcome, via gene network construction and analysis. Unlike many existing models, which consider the effects of each single gene or multiple but additive interaction effects on clinical outcome, the proposed method focuses on identifying the gene-gene interaction effect of any type on clinical outcome. By building a gene interaction network, we obtain a global view of the gene interaction landscape that is associated with the clinical outcome of patients. To gain better insight into the gene interactions that affect clinical outcome, we utilized available genomic profiles across different molecular levels. We find that the resulting integrated network has a greatly enhanced level of scale-freeness and biological significance than each network based on a single genomic profile.

Our method is different from many previous computational network analysis schemes in that an edge between genes in our network directly implies the interactive effect of a pair of genes on clinical outcome. For instance, Languino et al. constructed a correlation gene network from data for the NCI-human tumor cell lines [26]. Hong et al. proposed integrative network construction scheme from two independent dataset of ovarian cancer patients [27]. Network-based stratification, which was proposed by Hofree et al. uses mapping scheme from public databases to construct gene-gene interaction network [28]. However, all of those proposed methods constructed networks using information in features only and there was no consideration of the clinical outcome during the network construction. Thus, edges in the networks only represent the strength of interaction between two genes without difference of the outcomes in samples. On the other hand, we proposed an outcome-guided mutual information network in which edges reflect both the interaction effect and difference in the clinical outcome of the given samples. Moreover, the outcome-guided network could improve the survivability prediction performance of the network-based Cox-regression in comparison with traditional networks such as a correlation network or static protein-protein interaction network [29].

Instead of relying on parametric tests, which may suffer from a large number of pairwise tests and multiple testing issues, we use an information-theoretic measure of mutual information and a non-parametric approach to extract significant interactions among genes. Mutual information has been widely used as an association measure in the context of genome-wide association studies for detecting epistasis, but rarely in the association between general genomic features and clinical outcomes. It has the advantages of being flexible and easily applied to both discrete and continuous variables. We implemented an efficient non-parametric testing scheme based on permutation, for measuring the statistical significance of detected interactions.

Here, we apply the proposed method to TCGA data from ovarian cancer patients. Ovarian cancer is a fatal gynecological cancer that is the leading cause of genital system cancer death and fifth-most common fatal cancer among women in the United States [30]. The cancer shows a high recurrence and poor survival rate [31], which cannot be addressed by standard treatment. In this study we detected novel strong pair-wise interactions associated with survival in ovarian cancer, including many genes with little marginal effect. We also present the topological properties and biological significance of networks constructed from multiple genomic profiles.

Methods

Mutual information for identifying gene-gene interactions associated with clinical outcome

Using genomic profile data, we identify genomic interactions that are associated with clinical outcome, by utilizing an information-theoretic measure of mutual information [32]. It has been used successfully to detect linear or non-linear association between two random variables [3336]. In most previous studies for detecting interactions based on mutual information, it has been used as a measure of association between a pair of genes [34, 33]. In other words, focus was on interactions or correlations between genes. We take a different approach by using mutual information to assess the strength of association between a pair of genes and the clinical outcome of given samples. Below, we include a brief description of mutual information and how we modify it to capture genomic interactions associated with clinical outcome.

Entropy of a discrete random variable X is defined as

$$ H(X)=-{\displaystyle {\sum}_{x\in X}p(x){ \log}_2p(x),} $$

and joint entropy of two random variables X and Y is defined as

$$ H\left(X,Y\right)=-{\displaystyle \sum_{x\in X}}{\displaystyle \sum_{y\in Y}}p\left(x,y\right){ \log}_2p\left(x,y\right). $$

Mutual information of two random variables X and Y is defined as

$$ I\left(X; Y\right)=H(X)+H(Y)-H\left(X,Y\right). $$

In order to measure the strength of association between a pair of genes and clinical outcome, we use the extended version of mutual information, which is as follows:

$$ I\left({X}_1,{X}_2; Y\right)=H\left({X}_1,{X}_2\right)+H(Y)-H\left({X}_1,{X}_2,,,Y\right). $$

Here, X 1 and X 2 denote random variables for two genes, and Ydenotes random variables for the clinical outcome of patients.

When a random variable is discrete, its probability distribution can be easily approximated by the frequency of each possible value. If a genomic profile consists of continuous valued features, then it is not straightforward to calculate mutual information directly, because the respective probability distribution for the continuous variable is unknown by given values [37]. To address this, we use the histogram-based technique [34] to discretize continuous values. This technique divides the range of a set of continuous values into equal-sized bins. The binning interval of an i-th gene in a genomic profile is determined as \( \frac{Max\left({V}_i\right)-Min\left({V}_i\right)}{B} \), where B denotes the number of bins and V i is a continuous-valued vector for the gene in the profile. The size of the vector is the number of samples in the profile. As the result of discretization, a continuous expression value from a profile goes into one of the B bins.

We also discretize the clinical outcome variable as binary and divide patients into two groups based on survival months. As in previous studies dealing with binarized clinical information [14, 38], we define the short-term and long-term groups as the patients that survived less than or equal to 36 months, or more than 36 months, respectively.

Discretization of a genomic profile induces a partition on the set of samples. Then entropy of a random variable X can be defined in terms of the partition as follows:

$$ H(X)=-{\displaystyle \sum_{i=1}^n}\frac{\left|{A}_i\right|}{\left|S\right|}{ \log}_2\frac{\left|{A}_i\right|}{\left|S\right|}, $$

where X = {A 1A 2, …, A n } is a partition on the set of samples S, i.e. S = A 1 ∪ A 2 ∪ ⋯ ∪ A n and Ai ∩ A j  = ∅ for distinct i and j. Joint entropy of two partitions X = {A 1, A 2, …, A n } and Y = {B 1, B 2, …, B m } can also be defined as follows:

$$ H\left(X,Y\right)=-{\displaystyle \sum_{i=1}^n}{\displaystyle \sum_{j=1}^m}\frac{\left|{A}_i\cap {B}_j\right|}{\left|S\right|}{ \log}_2\frac{\left|{A}_i\cap {B}_j\right|}{\left|S\right|}. $$

It can be naturally extended to joint entropy of any number of multiple partitions.

Extraction of outcome-associated gene-gene interactions by permutation test

Since the exact probability distribution of mutual information computed on a dataset is generally unknown, the p-value for the significance of a computed mutual information value is not directly available. Instead of using an approximate scheme such as chi-square distribution approximation [39], we use a non-parametric approach based on the permutation strategy in [34] and derive a threshold for the mutual information value. Specifically, clinical outcome labels (short-term vs. long term) are randomly permuted and the mutual information values with respect to the permuted labels are calculated for every pair of genes. We repeat this 30 times and compute the average mutual information across 30 runs by \( {I}_{\mathrm{avg}}\left(i,j\right)=\frac{1}{30}{\displaystyle {\sum}_{p=1}^{30}{I}_{\mathrm{avg}}\left({g}_i,\ {g}_j,; {Y}_p\right)} \) for each pair of genes g i and g j , and Y p for the permuted clinical outcome labels at p-th run.

The threshold θ is determined as the maximum of average mutual information values, i.e., θ = max i ≠ j I avg(i, j). The pairs of genes having mutual information above this threshold with respect to the original clinical outcome labels are considered as associated with the clinical outcome and included for further analysis.

Construction of integrative gene networks

We compute the mutual information for every pair of genes and clinical outcome by using each genomic profile separately and obtain those interactions that are associated with clinical outcome by the proposed method. This results in an outcome-guided mutual information gene network in which two genes are connected if their combination is associated with clinical outcome. We denoted a network for each profile as follows:

$$ {G}_{\alpha}^{profile}=\left\{\left({g}_i,{g}_j\right)\Big|{g}_i,{g}_j\in P\ and\ I\left({g}_i,{g}_j,;,Y\right)\ge \theta \left(1+\alpha \right)\right\} $$

where g i and g j are two genes in the set of all genes P, θ is the threshold from the permutation strategy, and α is the parameter for adjusting the statistical significance level. We constructed gene networks by applying the proposed method to each of the mRNA expression, CNA, and methylation profiles, which we denoted as G mRNA α , G CNA α  , and G METH α .

To enhance our view on the gene interaction associated with clinical outcome across multiple genomic profiles, we can further construct an integrated network by merging the three networks. As a pilot study, two types of integrated networks are considered: I = G mRNA ∪ GCNA ∪ G METH (integrated network with one-or-more occurrence of association across profiles) and I = G mRNA ∩ GCNA ∩ G METH (integrated network with co-occurrence of associations in every profile) to figure out the overall characteristic and relation of different genomic profiles. Integrated network I is a union-set of associations which exists at least in one of the genomic profiles. In contrast, an edge for an association between two genes in I must be in every given single profile networks.

Survival analysis of identified gene pairs

Once we obtain pair-wise gene features associated with the clinical outcome, we perform the following survival analysis to validate the result. For a given pair of genes, the patients are stratified into two groups based on the feature value combination of the selected genes, as in the grouping method of Multifactor-Dimensionality Reduction (MDR) [40, 41]. We first set a threshold ρ as the ratio of the number of short-term survival patients to the total number of patients in a given dataset, which was 146/340 in our study. For each possible combination of feature values at the gene pair, we identify patients with the feature combination and examine the ratio of the number of short-term survival patients to the total number of patients among the extracted ones. Each combination of gene feature values is considered as high-risk if the ratio from the combination is above the threshold ρ, and otherwise, as low-risk. This stratifies the patients into two groups of high-risk and low-risk, based on the values of gene pairs. We then apply the log-rank test to assess the significance of the difference in survivability by the gene pair. This is performed on the identified gene pairs as well as on each gene for comparison.

Network analysis

We analyzed the constructed gene networks in terms of the network topologies and then in terms of the biological functionality through functional enrichment test. As many previous studies have revealed the scale-freeness of gene networks [42, 2, 4346], we examined the scale-freeness of the constructed gene networks along with other topological properties at each significance level. In a scale-free network, the distribution p(k) of the node degrees follows a power law p(k) ~ k − γ, where p(k) is the frequency of the node whose degree is k. To measure scale-freeness of a network, Zhang and Horvath [45] proposed to use the coefficient of determination R 2, which is the model-fitting index of the linear model that regresses log p(k) on log k. If R 2 is close to 1.0, the network is considered scale-free. For a network constructed from each genomic profile and for each significance level with varying parameter values of α = 0.0, 0.1, 0.5, 0.8, and 1.0, we measured the number of nodes, the number of edges, the number of connected components, the size of the largest component, and the measure of scale-freeness R 2.

We performed enrichment analysis on the obtained networks to assess common or related biological functionalities of the genes belonging to the same connected component of the network. We ran gene ontology (GO) [47] enrichment analysis for the network in Cytoscape [48] with Biological Network Gene Ontology tool (BINGO) [49]. We used Ontology and annotation data in (http://www.geneontology.org/). We ran those analysis for the co-occurrence network, the one-or-more occurrence network, and each of the three networks constructed by using each profile separately.

MINA: mutual information based network analysis framework

We developed a tool named MINA that automates the process of identifying significant gene interactions associated with clinical outcome and of generating various networks from those pairs. Figure 1 illustrates the overall process performed inside MINA. Genomic profiles, clinical outcomes, and the model parameters (B, C, and α) are used as the input. MINA then transforms continuous feature values that may exist in some genomic profiles (e.g., mRNA expression or methylation) and clinical outcome to discrete value based on the parameters B (the number of bins) and C (threshold for survival months) and calculate mutual information value for every possible pair of genes. This tool then outputs significant pairs of genes for a given genomic profile and the resulting networks.

Fig. 1
figure 1

Illustration of MINA

MINA is written in C++ and runs on operating system based on UNIX. We also used OpenMP (Open Multi-Processing) (http://www.openmp.org), a parallel processing library, to hasten the overall process. For the TCGA dataset, it took about 2 to 3 h to run the entire process in a common desktop computer. The source codes for MINA are publically available at https://github.com/hhjeong/MINA.

Results

Ethics statements

All data related to human subjects used for this study is de-identified and publicly available from The Cancer Genome Atlas project (http://cancergenome.nih.gov/). Therefore, this research is not classified as a human subject research and no Institutional Review Board approval is required.

TCGA data and pre-processing

We used genomic and clinical profiles of patients with ovarian serous cystadenocarcinoma from TCGA to demonstrate our proposed method. The genomic profiles included mRNA expression (mRNA), copy number alteration (CNA), and methylation (METH). We initially focused on the genomic features of 20,642 genes in the protein-coding region of 575 patients. The clinical information for the patients was also extracted. All datasets were downloaded from cBioPortal [50, 51] (http://www.cbioportal.org) that provides convenient data acquisition tools for TCGA data. Table 1 summarizes platforms and data types used in our study. We further pre-processed the datasets to filter out genes or patients and to discretize the data as described below.

Table 1 Summary of datasets used in this study

We applied a two-step procedure to filter genes and patients. In the first step, the following three filters were applied sequentially. First, each gene with missing values across the patient group was removed from all genomic profiles. Then, each patient with all missing values for the remaining genes was removed from all profiles. Finally, each gene with a missing value in at least one of the three profiles on the remaining patients was removed. Thus, we had 10,022 protein-coding genes in common across the three profiles of mRNA expression, DNA methylation, and copy number alteration.

As our analysis employed clinical information as a binary outcome of short-term versus long-term survival, in the second filtering step, we further excluded patients whose label assignments were ambiguous from the analysis. That is, the patients with no survival status or with a survival status as living and observed survival time of <36 months were filtered out in the second step. As a result, we had 146 patients in the short-term group and 194 patients in the long-term group.

The copy number alteration profile had discrete valued features with five values of −2, −1, 0, 1, and 2, and therefore, we directly used this representation from GISTIC [52] to compute mutual information. We discretize mRNA expression and DNA methylation profiles as described before with the parameter for the number of bins B = 5 to be consistent with CNA profile.

Distribution of mutual information on each genomic profile

We calculated mutual information values using the original and permuted clinical outcome labels of patients, for every pair of genes on each genomic profile in TCGA datasets. Figure 2 shows the empirical distribution of mutual information computed on each real profile (mRNA, CNA, METH) used in this study. The solid lines are with respect to the original clinical outcome labels, and the dotted lines are with respect to the permuted labels averaged over 30 runs. The results from the permuted labels could not create mutual information above 0.0763, 0.0664, and 0.0782 on mRNA, CNA, and methylation profiles, respectively. Therefore, we set these numbers as threshold mutual information θ for each profile separately. A pair of genes with mutual information above this threshold was considered to be associated with clinical outcome.

Fig. 2
figure 2

Empirical distribution of mutual information values. We show the distribution of mutual information values computed for every pair of genes in each profile of mRNA expression (red), CNA (blue) and methylation (yellow). The solid lines correspond to the values with respect to the original clinical outcome labels, and the dotted lines are with respect to the permuted labels averaged over 30 permutations

Gene interactions associated with clinical outcome occur more typically with respect to mRNA expression or copy number alteration levels, but less so with respect to methylation levels. The mRNA expression profile produced the highest number of gene pairs (2,562,178). The CNA profile was second with 2,472,048 pairs, and the methylation profile had far fewer interactions with 554,048 gene pairs (Table 2). This corresponds to about 1–5 % of all pairs of genes (i.e., out of 5 × 107 pairs). When we increase the significance level by setting the threshold as θ × (1 + α) and varying α = 0.0, 0.1, 0.5, 0.8 and 1.0, the number of remaining edges (or gene pairs) becomes substantially less. For example, when α = 0.5, the numbers of gene pairs are 20,219, 23,143, and 3,641, for mRNA expression, CNA, and methylation profiles, respectively. The overall result is summarized in Table 2.

Table 2 Threshold mutual information on each genomic profile

Survival analysis of selected pair-wise genes

We validated the significance of identified gene interaction effects on clinical outcome by applying the survival analysis described in Methods. Table 3 shows the results of the log-rank test applied to the top 10 gene pairs from each genomic profile. All of the top 10 gene pairs induced a significant difference in survival, with p-values ranging from 1.67 × 10− 3 to 5.08 × 10− 7 across different profiles. In Fig. 3, the Kaplan-Meier survival curve of the gene pair that has the highest mutual information is shown for each profile, along with the ones derived by each single gene. The top pair of genes from the mRNA expression profile was MYO3A, a previously identified cancer gene [53] and SWI5, a recombination repair homolog. The p-value from the log-rank test for survival difference according to the gene pair was 6.62 × 10− 5, while each single gene produced p-values of 0.02 (MYO3A) and 0.4 (SWI5). In the case of the CNA profile, the top pair was from SNRPB2 and WSB2, both cancer genes documented in COSMIC [54], with a p-value of 1 .21 × 10− 4, whereas the p-value based on each gene separately was 0.08 and 0.3, respectively.

Table 3 Top 10 gene pairs for each genomic profile
Fig. 3
figure 3

Kaplan-Meier survival plots of the gene pair with the highest mutual information value for each single profile. We show the Kaplan-Meier survival curve of the gene pair having the highest mutual information along with the ones derived by each single gene

For more comprehensive analysis, we ran the survival analysis for all the extracted gene pairs obtained from four different significance levels of α = 0.0, 0.5, 0.8 and 1.0. The distribution of the resulting p-value is shown in Fig. 4 as a box plot. For comparison, we also included the box plots for p-values for each single gene in the identified gene pairs. Overall, the association significance was substantially stronger in the case of gene pairs than in single genes, across different profiles and parameter settings. This means that there are many genes having weak or no effects, but a strong interaction effect on clinical outcome. Moreover, at each parameter α, the most significant p-value becomes much larger, that is, −log(p-value) becomes much smaller when we consider the single genes separately, in the case of mRNA and CNA profiles. The methylation profile behaved differently in that the top p-value at α = 0.0 was very similar in both the pairwise and single analyses. It appears that the gene-gene interaction at the methylation level is not as prominent as in other profiles, and the top interaction effects are largely based on the marginal effects of single genes.

Fig. 4
figure 4

Boxplots for p-values from survival analysis. The distribution of p-values from the survival analysis for the extracted gene pairs obtained from different significance levels of α is shown as a boxplot

Outcome-guided mutual information gene networks

We constructed outcome-guided mutual information gene networks by considering genes as nodes, and connecting two gene nodes if their combination was significantly associated with clinical outcome. For a network constructed from each genomic profile and also for each significance level with varying parameter values of α = 0.0, 0.1, 0.5, 0.8, and 1.0, we measured the number of nodes, the number of edges, the number of connected components, the size of the largest component, and the measure of scale-freeness R 2 (Table 4).

Table 4 Network Topologies for different α values

Overall, networks based on mRNA expression and CNA profiles tended to have a larger value of R 2 as α increases, with the maximum at α = 0.8. The networks based on the methylation profile tended to have smaller R 2 when we increased α. We then examined the I and I at each setting. The number of gene interactions appearing across all three profiles was relatively small. For example, at α = 0.1, the number of edges in I was only 95, while the one-or-more occurrence network (I) at the same significance level had more than 2 million edges. There was no common edge across all of the profiles at a significance level of 0.5 or higher. Also, we did not find a shared edge between any pair of profiles at a significance level 0.8 or higher.

Interestingly, the integrated network, either by taking the intersection or the union of edges, appeared to have a significantly enhanced scale-freeness. The co-occurrence network I 0.1 had the highest R 2 value of 0.950, and the one-or-more occurrence network with I 0.8 had the second highest R 2 value of 0.913. This may suggest that integrated networks are more effective in identifying functional gene modules across multiple molecular levels than networks constructed by using each profile separately. We selected these two networks to run further analysis. The graphical representation of the selected intersection network and the union network is shown in Fig. 5 and Fig. 6, respectively.

Fig. 5
figure 5

I 0.8 of whole genomic profiles

Fig. 6
figure 6

I 0.8 of whole genomic profiles

We performed gene ontology (GO) enrichment analysis to assess common or related biological functions of the genes belonging to the same connected component of the constructed network. We ran the analysis for each of the three networks based on mRNA, CNA, and methylation profiles, and for their one-or-more occurrence network at α = 0.8. The co-occurrence network at α = 0.1 was analyzed due to its superior scale-freeness and network sparseness at a higher significance level.

We first compared the number of enriched GO terms from each constructed network (Fig. 7). The mRNA profile revealed the greatest number of significant terms among the single networks, which was expected. There was no shared GO term between the CNA and methylation profiles, which may suggest distinct functional roles for each profile on clinical outcome. I 0.8 indicated the greatest number of enriched GO terms with 62 additional BP (Biological Process), 21 CC (Cellular Component), and 11 MF (Molecular Function) terms, which were not found in networks constructed by any of the single genomic profiles. Therefore, the integration of networks may provide a better insight into the gene interaction landscape associated with clinical outcome.

Fig. 7
figure 7

Four-way Venn diagram summarizing the number of shared and unique GO terms enriched in the network from each profile

We further investigated the genes in the largest component of I 0.8 , which were enriched with 176 GO terms (112 BP, 42 CC, and 22 MF terms). The five most significant GO terms in the largest component were poly(A) RNA binding (GO:0044822), nucleoplasm (GO:0005654), extracellular vesicular exosome (GO:0070062), apoptotic process (GO:0006915), and protein ubiquitination (GO:0016567). These GO terms are closely related to ovarian cancer, based on previous studies. For example, apoptotic process is a cell death term, and Jäättelä reported that defects in apoptotic signaling pathways are common in cancer cells [55]. In addition, protein ubiquitination is a highly relevant term as ubiquitin-mediated proteins have an important role in the mutation of a target oncogene [56]. Table 5 summarizes significantly enriched GO terms with the corresponding p-values for the largest connected component of the I 0.8 . To present more specific functionality, we show the term at the lowest level from the root of the directed acyclic graph for each GO category if multiple terms along the same path from the root are found to be significant.

Table 5 Significantly enriched GO terms in the largest component of I 0.8

We also found that major hub genes of the I 0.8 network are related with ovarian cancer-related pathways. For example, Cytohesin 3 (CYTH3), the first hub having the largest number of neighbors in the network, is involved in the PI3K pathway (M14532) in MSigDB [57]. This pathway is a common drug target of human cancer, including ovarian cancer [58, 59]. Furthermore, Minichromosome maintenance complex component 3 (MCM3), the third hub, is included in the cell cycle pathway (hsa04110) [60], which is important to the cancer research because alterations in the mechanism characterize the abnormal proliferation of human malignant tumors [61]. Previous research also reported that the cell cycle arrest in the G2/M phase via the blockade of cyclin B1/CDC2 in human ovarian cancer cells [62]. From this observation, we presume that interactions of major hub genes with connected neighbors can play an important role in determining the overall survival of ovarian cancer patients.

For the I, many BP terms were discovered in the largest connected-component, but not from CC or MF categories. Table 6 shows the most significant GO terms for the largest connected-component of the co-occurrence network. The 5 most significant GO terms were hemopoiesis (GO:0030097), immune system development (GO:0002520), aging (GO:0007568), T cell differentiation (GO:0030217) and positive regulation of apoptotic process (GO:0043065). Immune system development and T cell differentiation are terms corresponding to the immune system, which has a significant role in cancer development and progression [63]. Positive regulation of apoptotic process is a cell death term, and is enriched in genes regulated by Ubiquitin carboxyl terminal hydrolase 1 (UCHL1) [64], which is a putative tumor suppressor in ovarian cancer. The hub genes also have known roles in cancer progression. For example, the top hub gene in the network was ST6GALNAC1 which is known to have an important role in ovarian cancer [65].

Table 6 Significantly enriched GO terms in the largest component of I 0.1

Discussions

We have proposed a new network-based analysis framework to detect gene pairs associated with the clinical outcome and to analyze the resulting networks systematically. Our survival analysis showed that there are a large number of gene pairs that are significantly associated with survival in ovarian cancer in which each single gene has very weak or no association. From the integration of the profiles, we also showed that networks constructed by combining information across different genomic profiles had better scale-freeness and revealed more biological significance than a network that was constructed by using only one genomic profile.

In our analysis, the co-occurrence network consisted of a moderate level of interactions in single genomic profiles, but integration of the interactions revealed high biological significance in terms of GO BP terms. In contrast to the I 0.1 , the I 0.8 consisted of stronger interactions for each genomic profile, and significant CC and MF terms were enriched. Interestingly, networks from interactions with high association strength at each profile did not have any shared edges. We also found that sub-networks in the I 0.8 , which were connected by interactions of mRNA and methylation, had many hubs connected to many peripheral nodes, but sub-networks from CNA had a tendency to interconnect genes without any dominant hub gene structure.

In this study, we took a simple network integration scheme, which showed enhanced network properties despite its simplicity. A more complicated network integration scheme may be employed in our future analyses, such as that used in similarity network fusion using multiple genomic datasets [15]. Besides, we plan to investigate the detection power and robustness of the proposed method through extensive simulation study and real data experiments. Another extension includes the application of the integrative network to network-based Cox-regression method using heterogeneous types of data. We expect that this application would enhance the prediction power and help to understand the complex interaction between different types of genomic profiles for the survivability of cancer patients.

Conclusions

In this paper, we have proposed a simple but powerful method to detect gene pairs that are associated with the clinical outcome. By being network-based, our approach could provide a better insight into the underlying gene-gene interaction mechanisms that affect the clinical outcome of cancer patients.