Background

Cancer genes are involved in the dysfunction of a wide range of cellular functions including cell proliferation, angiogenesis, tumor invasion, DNA repair, chromosome stability, cell–cell communication, cell–matrix interactions, motility, metastasis, and apoptosis [1]. Much of recent cancer research has been devoted to identifying genes related to cancer initiation and progression computationally, and many different types of approaches have been suggested to this end. A comprehensive recent survey on computational approaches for the identification of cancer genes and pathways has been provided in [2].

One possible categorization of the computational approaches for cancer gene identification is based on the data they employ. Those employing mutations data to extract candidate cancer genes are based on the presupposition that driver genes can be identified via a thorough examination of recurrent mutations, whose observed frequency in a large cohort of cancer patients is much higher than expected. However usually a significantly low overlap in alterations of the alternative driver genes is observed, giving rise to what is known as mutual exclusivity. Several approaches relying on mutations data thus have developed specialized techniques to deal with the issue of exclusivity [37]. A second class of approaches consist of those employing gene expression data in the form of expression profiling, gene coexpression, or differential expression analysis [1, 810].

Recent integrative approaches employ one or both types of expression and mutations data together with interactions network data in the form of genetic or protein-protein interactions (PPI) [1114]. Approaches combining gene expression data with the relevant interactions data in the context of long non-coding RNAs (lncRNA) have shown promising results in identfying lncRNA-disease associations [1519]. Particularly, the interactome has demonstrated its usefulness in explaining the observed patterns of mutations either in healthy or in diseased individuals [20]. Rather than identifying a set of cancer-related genes, the goal of the integrative computational approaches usually is to extract modules deemed central to the cancer. HotNet2 employs a random-walk on the PPI network distributing the mutation frequencies of genes throughout the network, giving rise to a directed graph where the strongly connected components represent the output modules [21]. MEMCover combines mutual exclusivity data of mutations across several tissue types with the PPI network data to produce modules of cancer genes [22]. Although potentially useful for pan-cancer analysis, such approaches have limited use for specific cancer types where relatively small number of samples does not provide adequate information in the form of mutual exclusivity of the mutations. Furthermore they focus on the discovery of cancer modules rather than prioritizing individual genes as cancer drivers. By contrast, a recent cancer gene prioritization method, MUFFINN, applies a network-centric analysis of mutation data thereby integrating mutational information for individual genes and their neighbors in functional/interaction networks. It is suggested that MUFFINN’s cancer gene prioritization has good performance even in the setting where only data from a limited number of samples is employed [23].

We employ mutations data, gene expression data, as well as network data in the form of PPI networks, to identify individual driver genes related to breast cancer. The general framework consists of a comparative analysis of graph-theoretical measures. It is based on differential identification of breast cancer genes via a pairwise comparison of the values attained for a specific graph-theoretical measure applied on a normal and a tumor tissue sample over all available samples. Although recent studies comparing normal and tumor samples with regards to changes in genetic data including those in the form of mRNA expression, miRNA expression, or methylation alterations have beeen suggested, our study extends these approaches by introducing a network aspect and several common graph centrality measures, into the comparison [2426]. We note that graph centralities have been employed in the context of identifying breast cancer genes in the past [27]. Such an approach has been revisited recently and an extension employing two different machine learning classifiers on computed centrality scores have been suggested [28]. However rather than incorporating gene expression and mutations data, as is done in our study, these approaches are limited to gene signatures; a set of centrality measures have been applied to PPI networks limited to genes already known to be related to breast cancer, to assign a degree of importance. Furthermore, our framework involving a comparative analysis of network centralities in pairs of graphs generated from normal and tumor tissue samples introduces a novelty that enables a differential analysis of genes involved in breast cancer.

Methods

We summarize the overall methodology in Fig. 1. Three main components consist of data preparation, algorithmic computations, and analysis and evaluation of results. Data preparation involves necessary preprocessing of gene expression, mutations, and network data. This is followed by the algorithmic computations step involving several graph-theoretical distance measures. The output consisting of lists ordering genes with respect to their degrees of involvement in breast cancer is evaluated in the final step. This involves ROC and precision/recall analysis as compared to two golden standard databases, COSMIC and NCBI BioSystems, and gene ontology analysis with respect to the GO database, in addition to these two golden standard datasets. The output list of the best performing measure is further filtered and a detailed review of its top genes is done through literature verfication.

Fig. 1
figure 1

Flowchart summarizing the overall methodology. Flowchart summarizing the overall methodology. The first step depicted in part-a consists of data processing and necessary filtrations of the input databases TCGA and IntAct. The second step depicted in part-b involves generation of pairs of normal/tumor graphs based on expression, mutations, and interactions data. Measures based on graph-centralities are employed on resulting graphs. Ten lists of genes, eight from centrality measures and two from control measures, ordering genes with respect to their computed weights are provided as output. The final step depicted in part-c consists of analyzing the ten lists with regards to ROC, precision/recall (P/R), and GO consistencies (GOC). Two datasets, NCBI BioSystems [37] and COSMIC [38] are employed in all three analysis, whereas for the GOC analysis an additional database, the GO database [39] is also employed. Among all tested centrality-based measures M bw provides the best performance in all three analyis. The M bw list is further analyzed in more detail by filtering it based on a maximum weight independent set (MWIS) formulation, and the top genes from the resulting filtration go through a final literature verification step. a Data preparation, b Algorithmic computations, c Analysis and evaluation

Input data sets and data preparation

We gather the breast cancer data from The Cancer Genome Atlas Project (TCGA). There are 99 instances; each instance contains data in the form of expression levels of genes in the normal and tumor tissue samples of a patient, and relevant mutation information regarding the tumor samples. For gene expression, we consider the RPKM (Reads per kilo base per million mapped reads) normalization which includes a gene length normalization of RNA-seq data and apply a threshold of 1 to assign a gene as expressed. All somatic mutations other than those marked as silent are taken into account. In addition, we employ the H. Sapiens protein-protein interaction network of the the October, 2016 version of the IntAct database [29]. The PPI network is filtered so that each interacting pair is a protein and each interaction is a physical interaction.

Graph-theoretical framework

Let H be the H. Sapiens PPI network. Employing the TCGA data, for each instance i of the available 99 instances, we create a pair of graphs, N i ,T i , corresponding to normal and tumor graphs respectively. The graph N i is the subgraph of H induced by the node set corresponding to the set of genes expressed in the normal instance of i, whereas T i is the subgraph induced by expressed and non-mutated genes in the tumor instance of the same sample i.

Let P be a list of pairs of graphs such that P=(N 1,T 1),…,(N r ,T r ), where each N i ,T i corresponds respectively to normal and tumor graphs of the instance i. Let \(\mathcal {V} = V_{N_{1}}\cup \ldots \cup V_{N_{r}}\cup V_{T_{1}}\cup \ldots \cup V_{T_{r}}\), where V G denotes the node set of a graph G. A measure M x is a function defined on P that orders the nodes in \(\mathcal {V}\), according to some graph-theoretical property x. The performance of a measure depends on how well the position of each gene in this ordering matches its revelance to the cancer under study. The measures we consider are based on the following graph-theoretical properties commonly employed in network analysis studies: betweenness centrality, random walk distances, graph-theoretical distances, clustering coefficient, degree centrality, and Jaccard indices. All of these measures are defined on the nodes of a graph. According to the traditional classification of graph-theoretical properties, the first three are global measures, whereas the last three are local measures. A global measure defined on a node is a function of the whole graph globally, whereas a local measure defined on the node usually is a function of some locality centered around the node. For the purposes of this study, we introduce a novel classification, that of unlabeled versus labeled measures. A measure of the former type on a node considers all the rest of the graph as unlabeled; the topology of the network matters but not the relationships between specific node pairs. For the latter, the node labels are important as well as the network topology. The betweenness centrality, the clustering coefficient, and the degree centrality are unlabeled measures, whereas the random walk distance, the graph-theoretical distance, and the Jaccard index based neighborhood overlaps are labeled measures. Once an ordering of the nodes with respect to a measure is determined, we apply a filtering based on maximum weight independent sets (MWIS) to select a subset of crucial nodes deemed important for the cancer under study.

Unlabeled graph-theoretical measures

In what follows we provide detailed descriptions of the employed measures. For each measure we provide a node weight assignment scheme, which defines the ordering of the measure. For the following let G=(V,E) be an undirected graph where V denotes the node set and E denotes the edge set of the graph G. We first provide the definitions of four unlabeled graph-theoretical measures.

M bw : This measure is based on the betweenness centrality. Given G=(V,E), the betweenness of a node vV is defined as \(bw_{G}(v)=\sum _{\forall s,t\in V, s\neq v\neq t}\frac {\sigma _{st}(v)}{\sigma _{st}}\) where σ st is the number of shortest paths between nodes s,t and σ st (v) is the number of such paths that go through the node v. This value is divided by \(\frac {2}{(|V|-1)(|V|-2)}\) for normalization. Note that for a node vV, b w G (v)=0 trivially. Our first measure M bw sorts the nodes of \(\mathcal {V}\) in non-increasing order of the node weight function W bw , defined for a node v as,

$$ W_{bw}(v) = \sum_{\forall (N_{i}, T_{i})\in P}\left|bw_{N_{i}}(v)-bw_{T_{i}}(v)\right| $$
(1)

M cc : This measure is based on the clustering coefficient. For a node v in a graph G=(V,E) the clustering coefficient of v, c c G (v) is defined as, 2|C|/(d e g G (v)(d e g G (v)−1)), where C is the set {(s,t)∈E:(v,s)∈E,(v,t)∈E}. We note that for a node vV, c c G (v)=0 trivially. The measure M cc sorts the nodes of \(\mathcal {V}\) in non-increasing order of the weight function W cc , defined for a node v as,

$$ W_{cc}(v) = \sum_{\forall (N_{i}, T_{i})\in P}\left|cc_{N_{i}}(v)-cc_{T_{i}}(v)\right| $$
(2)

M d e g1, M d e g2: These measures are based on the degree centrality. Let N e G (v) denote the set of neighbors of v in G and let \(Ne_{G}^{2}(v)\) denote the set consisting of N e G (v) together with the neighbors of all nodes in N e G (v). The measure M d e g1 sorts the nodes of \(\mathcal {V}\) in non-increasing order of the node weight, defined for a node v as,

$$ W_{deg1}(v)=\sum_{\forall (N_{i}, T_{i})\in P}||Ne_{N_{i}}(v)|-|Ne_{T_{i}}(v)|| $$
(3)

whereas the measure M d e g2 employs the weighting defined as,

$$ W_{deg2}(v)=\sum_{\forall (N_{i}, T_{i})\in P}\left|\left|Ne_{N_{i}}^{2}(v)\right|-\left|Ne_{T_{i}}^{2}(v)\right|\right| $$
(4)

Labeled graph-theoretical measures

We provide the definitions of four labeled graph-theoretical measures.

M rw : We employ proximity matrices based on random walks of the networks for this measure.

We note that similar methods have been employed in many previous PPI network analysis studies [3032]. Let \(Ne_{G}^+(v)=Ne_{G}(v)\cup \{v\}\). Assuming the origin of the walk is node u, let P r G′[u,v] denote the probability that the random walker is at node v after a certain number of time steps and P r G [u,v] denote the same probability after one more time step. Initially P r G′[u,u]=1, P r G′[u,v]=0 for vu. P r G [u,v] is computed from P r G′[u,s] for sN G+(v). The contribution of a neighbor s of v to P r G [u,v] is \(\frac {Pr_G'[u,s]}{|Ne_{G}(s)|+1}\). A small constant ε is decremented from this contribution to increase the chances of the walker remaining close to the origin. Each probability is normalized by dividing it with \(\sum _{v\in V}Pr_{G}[u,v]\). The procedure is repeated until the sum of the differences of probabilities with those of the previous time step does not exceed a predefined constant threshold. P r G [p,q]=0 trivially, if pG or qG. The measure M rw based on random walk distances sorts the nodes of \(\mathcal {V}\) in non-decreasing order of the node weight W rw , defined for a node v as,

$$ \sum_{\forall (N_{i}, T_{i})\in P}PCC\left(Pr_{N_{i}}[-,v],Pr_{T_{i}}[-,v]\right) $$
(5)

where P r G [−,v] denotes the column vector corresponding to v in the random walks-based proximity matrix P r G and P C C(x,y) denotes the Pearson correlation coefficient of the vectors x,y. P r G [p,q]=0 trivially, if pG or qG.

M gt : Our next measure M gt is based on graph-theoretical distances and is defined in exactly the same way as the previous measure M rw , except now an entry P r G [u,v] of the proximity matrix P r G defines the graph theoretical distance between nodes u,v in G, that is the length of the shortest path between u,v.

M j1, M j2: We define two measures based on Jaccard indices with respect to neighborhood overlaps. The measure M j1 sorts the nodes of \(\mathcal {V}\) in non-decreasing order of the node weight, defined for a node v as,

$$ W_{j1}(v) = \sum_{\forall \left(N_{i}, T_{i}\right)\in P}{\frac{\left|Ne_{N_{i}}(v)\cap Ne_{T_{i}}(v)\right|}{\left|Ne_{N_{i}}(v)\cup Ne_{T_{i}}(v)\right|}} $$
(6)

whereas the measure M j2 employs the weighting defined as,

$$ W_{j2}(v) = \sum_{\forall (N_{i}, T_{i})\in P}{\frac{\left|Ne_{N_{i}}^{2}(v)\cap Ne_{T_{i}}^{2}(v)\right|}{\left|Ne_{N_{i}}^{2}(v)\cup Ne_{T_{i}}^{2}(v)\right|}} $$
(7)

Filtering based on maximum weight independent sets

The graph-theoretical measures of the previous subsections provide a node weight assignment scheme in a way that the weight of a node represents the importance of the protein corresponding to the node regarding the cancer under study. However due to the network influence-based nature of some of these measures, they maybe susceptible to guilt by association; a node may end up with a large weight designating it a crucial protein, only because some of its neighbors have large weights. This is especially evident in measures based on betweenness centrality, random-walks, or graph-theoretical distances, as the weight of a node is dependent on the weights of its neighbors in the PPI network. In order to alleviate this issue and produce only a small set of crucial proteins, we apply a filtering on the node-weighted PPI network. The network consists of all the proteins involved in all normal, tumor instances under study and the node weights are assigned as those resulting from applying one of the mentioned graph-theoretical measures. Given a node-weighted graph G, the maximum weight independent set (MWIS) of G, is the set of nodes with maximum total weight such that no two nodes are neighbors in G. We note that the computational problem is NP-complete [33]. Several greedy heuristics have been investigated in [34]. The GWMIN2 heuristic which selects the node u in the conflict graph \(\mathcal {C}\) that maximizes \(\mathcal {W}(u) / \sum _{v\in N_{\mathcal {C}}^{+}(u)}{\mathcal {W}(v)}\), where \(N_{\mathcal {C}}^{+}(u)\) denotes the neighborhood of u in \(\mathcal {C}\) together with the node u itself, provides better results than the rest of the known heuristics [35]. Furthermore it provides a theoretical guarantee that the weight of the output independent set is at least \(\sum _{u\in V_{\mathcal {C}}} \left [{\mathcal {W}(u)}^{2} \left / \sum _{v\in N_{\mathcal {C}}^{+}(u)}{\mathcal {W}(v)}\right.\right ]\), where \(V_{\mathcal {C}}\) denotes the vertex set of the conflict graph \(\mathcal {C}\). Therefore the filtration step is implemented via the GWMIN2 heuristic for the MWIS problem.

Results and discussion

We implemented the described measures in C++ using the LEDA library [36]. We show that in determining the quality of a graph-theoretical measure for identifying genes related to breast cancer, the labeled/unlabeled classification is more important than the traditional local/global classification of the measures. Furthermore we show that under this classification, the unlabeled measures perform better than the labeled measures in extracting breast cancer genes via comparison of normal/tumor network instance pairs − contrary to the intuition that the latter employs more information in the form of labeled networks. Our evaluations indicate that the measure based on betweenness centrality is the best performer in terms differential identification of breast cancer genes across all normal/tumor samples.

Evaluations with respect to known cancer databases

Comparing against known cancer databases taken as golden standards, we measure the performances based on Receiver Operating Characteristic (ROC) and Precision/Recall (PR). As the golden standard to compare against the gene list of each of the graph-theoretical measures under study, we employ two separate databases. One is the integrated breast cancer pathway from the NCBI BioSystems database [37] and the other is the cancer Gene Census of the COSMIC database [38]. We note that whereas NCBI BioSytems data is specific to breast cancer, the COSMIC database covers genes relevant to all types of cancer. Thus we can evaluate how well each of the defined measures can identify both breast cancer-specific genes and cancer genes not specific to any certain type.

Every evaluated measure is designed so that it orders the genes from most relevant to the least. We extract the top k % genes from the list of each of the defined graph-theoretical measures, for every k between 1 and 100 at the increments of 1. In addition to the measures under study, we introduce two additional control measures. The first one is the expression difference (ED) measure which orders the genes with respect to the ED values. E D(v) for a gene v is defined as the absolute value of the difference between the number of normal and tumor samples including v as an expressed gene. The second control measure is the mutation frequency (MT) which orders the genes with respect to the number of tumor samples including them as mutated genes.

Figure 2 provides the ROC curves of all the employed graph-theoretical and control measures. In the left plot, the true positives and false positives are computed based on the comparison of the top k % genes of the output list of each measure against the NCBI BioSystems database, whereas in the right plot the reference database is COSMIC. The respective PR curves are provided in Fig. 3. The corresponding AUROC and AUPR values are provided in Table 1. With respect to the ROC/PR curves and the AUROC/AUPR values the best performing measure is M bw . The AUROC value of the M bw list as compared to the NCBI BioSystems dataset is 0.77 and its AUPR value in the same setting is 0.042. With regards to the COSMIC dataset the AUROC value of the M bw list is 0.709, whereas its AUPR value is 0.091. It is clear that the rest of the unlabeled measures also perform better than the labeled measures for most values of k. It is interesting to note that a measure as simple as degree differentiation between normal and tumor samples across all samples, that is M d e g1, provides a better recognition of cancer-related genes than those of the more complicated measures making use of extra information in the form of labels, such as graph-theoretical distances or Jaccard index based measures. Note also that all the unlabeled measures perform consistently better than the control measures ED and MF with respect to both of the employed golden standard cancer gene databases.

Fig. 2
figure 2

ROC Plots ROC curves for the measures under consideration for k changing from 1 to 100 at the increments of 1. True positive, false positive rates are with respect to the NCBI database (left) and the COSMIC database (right)

Fig. 3
figure 3

PR curves PR curves for the measures under consideration for k changing from 1 to 100 at the increments of 1. Precision and recall are with respect to the NCBI BioSystems database (left) and the COSMIC database (right)

Table 1 AUROC and AUPR values for all the defined graph-theoretical measures and the control measures

Evaluations based on gene ontology

An additional database is employed in setting up the next evaluation; the Gene Ontology (GO) database [39]. The GO database annotates proteins from several species with appropriate GO categories organized as a directed acyclic graph (DAG). In order to standardize the GO annotations of proteins, similar to the evaluation methods of [4042], we restrict the protein annotations to level 5 of the GO DAG by ignoring the higher-level annotations and replacing the deeper-level category annotations with their ancestors at the restricted level. For a node uV, let G O(u) indicate the set of standard GO annotations of the protein corresponding to u. For a given list T of genes to be tested and a reference list R, we define a GO Consistency (GOC) score as,

$$\frac{\sum_{t\in T}\sum_{r\in R}|GO(t)\cap GO(r)| / |GO(t)\cup GO(r)|}{|R|}. $$

The list T consists of the top k % of the genes provided by one of the graph-theoretical measures under study or one of the two control measures (ED, MF), and R corresponds to one of the two golden standard datasets. Small values of k are of more interest, since the output candidate list of genes are usually intended for further detailed inspection. The results for k upto 25 are presented in Fig. 4. We only show the plot when the golden standard list R is the NCBI BioSystems pathway; the plot resulting from the GOC evaluations with respect to the COSMIC database is almost the same. It is clear that the performance trends of the evaluated measures are almost the same as those of the previous metrics based on ROC and PR, although with less emphasized differences.

Further detailed simultaneous inspection of the top two lists, M bw and M d e g1, and the GO consistency analysis with respect to the NCBI BioSystems data reveals that the top contributors to the corresponding GOC scores show significant overlap. At k=5, that is when the top 5% of the gene lists are considered, the four genes contributing most to the GOC score in both lists, M bw and M d e g1, are IGF1R, RAF1, YWHAB, and MYC. Note that none of these are directly listed in the golden standard gene list of the NCBI BioSystems. Among the notable GO categories they commonly or independently share with those associated with the golden standard genes are GO:0008284 (positive regulation of cell proliferation), GO:0009890 (negative regulation of biosynthetic process), GO:0016310 (phosphorylation), GO:0031325 (positive regulation of cellular metabolic process), and GO:0010648 (negative regulation of cell communication). Same analysis with respect to the COSMIC database provides CTBP2, ATF3, FHL2, NFKB2 as shared top contributors in both lists M bw and M d e g1. It is worth emphasizing that other than the last one, none of these genes is listed in the COSMIC database itself.

Fig. 4
figure 4

GO Consistency Evaluations The results of the GO Consistency evaluations, with regards to the NCBI BioSystems data, for k changing from 1 to 25 at the increments of 1

Evaluations with rewired networks

Employing the criteria of the previous subsections, that is the criteria based on the ROC analysis and the GO consistency analysis with respect to the two golden standards, we further tested the two best-performing measures, M bw and M d e g1, on different networks. The networks under consideration are again based on the IntAct PPI network but modified with the introduction of varying degrees of random error via rewirings: r % of the existing edges are removed randomly and the same number of edges are inserted between random pairs of nodes not adjacent in the original network. This procedure is repeated four times giving rise to four randomly rewired networks for each value of r=5,10,15,20. For each rewired network the rest of the framework is the same; a pair of normal and tumor networks is generated based on the expression and mutation information of each instance by taking the induced subnetwork of the rewired network, and the relevant functions M bw ,M d e g1 are computed throughout all the networks. Thus, considering the induced graphs of all the samples, 99 normal and 99 tumor, in total 3168 graphs are generated and the suggested measures execute on all these graphs. The experiments on the rewired networks serve also the purpose of testing how sensitive the suggested graph-theoretical measures are to the noise in the network data.

We present the resulting AUROC and AUPR values in Table 2. Note that the true positives, false positives, precision, and recall values are computed as an average of respective values attained in four randomly rewired networks generated with the same ratio r. As expected the general tendency for AUROC and AUPR values with respect to both golden standard datasets is to decrease as the random rewiring ratio r increases. The slight discrepancies are due to the randomness in the rewirings. It should be noted that even though there is a performance decrease with growing random error in the network, this degradation in the performance is relatively small. For M bw , the AUROC values decrease by only 4.5% and 4.9%, respectively, for the NCBI and COSMIC databases, even with a 20% random rewiring of the original network. The respective percetages of degradation in the AUROC values of M d e g1 are 2.2% and 3.3%. The performance degradations with respect to the AUPR values are slightly higher; for M bw they are 7.1% and 9.9%, and for M d e g1 they are 8.1% and 6.7%. This is an indication that in addition to providing good performance, the suggested measures for cancer gene prioritization are also relatively robust to random noise in the interaction network data. A closer comparative look at the rates of degradation in performances in terms of AUROC, AUPR values of M bw and M d e g1 reveals that the former gets more error-prone as the degree of noise in the network increases.

Table 2 AUROC and AUPR values for M bw (multicolumns in the middle) and M d e g1 (multicolumns on the right) on randomly rewired networks with rewiring ratio r=5%,10%,15%,20%. For a fixed ratio r, each value is computed as an average of four randomly rewired networks

The same phenomenon is also evident in the GO consistency analysis. The plot of GOC values of prioritized lists of M bw and M d e g1 on randomly rewired networks, for each ratio r, with respect to the NCBI database is provided in Fig. 5. Since the plot with respect to the COSMIC database is almost the same we do not present it. Note again that the plotted values are those averaged over the values resulting from experimental runs of four randomly rewired networks, for each r. As with the ROC analysis, it is clear that M bw and M d e g1 are both quite resilient to noise in the interaction network simulated via random rewirings, with M d e g1 even more so than M bw .

Fig. 5
figure 5

GO Consistency Evaluations on Rewired Networks The results of the GO Consistency evaluations on rewired networks, with regards to the NCBI BioSystems data, for k changing from 1 to 25 at the increments of 1. The plot only shows GOC values for k≥15, since the previous values are mostly convergent. The numbers in parantheses indicate the ratio r. For each ratio r, the experiments are run on four randomly rewired networks and an average GOC value is taken

Comparisons against an alternative gene prioritization

We compare the results of the two measures performing the best, M bw ,M d e g1 against an alternative method for cancer gene prioritization. MUFFINN is similar to the gene prioritization methods suggested in this study both in terms of the employed data and the goal of disease gene prioritization in the presence of data from a limited number of patient samples [23]. In terms of input datasets, it also employs mutation data from patient samples and network data in the form of functional networks or interaction networks. The underlying hypothesis of MUFFINN is that a gene is more likely to represent a true cancer driver if it is functionally associated with other genes in an interaction network. For such a network-based mutation data analysis, they consider two ways to take into account mutational information among direct neighbors in the network. One is to consider mutations in the most frequently mutated neighbor and the second is to consider mutations in all direct neighbors with normalization by their degree connectivity. We call the former M U F F I N max and the latter M U F F I N sum .

We executed both M U F F I N max and M U F F I N sum with the same data employed in this study, that is the interaction network is the same IntAct network and the samples are the same TCGA samples as those used by our graph-theoretical prioritization methods. We extract the top k % genes from the list of each of the prioritization methods under comparison M bw ,M d e g1 and M U F F I N max , M U F F I N sum , for every k between 1 and 100 at the increments of 1. We then apply ROC and precision/recall analysis. In the left plot of Fig. 6 the true positives and false positives are computed based on the comparison of the top k % genes of the output list of each method against the NCBI BioSystems database, whereas in the right plot the reference database is COSMIC. The numbers in parantheses indicate the AUROC values of the relevant methods. The respective PR curves are provided in Fig. 7 and the numbers in parantheses indicate the corresponding AUPR values.

Fig. 6
figure 6

ROC Plots including MUFFINN Results ROC curves for M bw ,M d e g1 measures versus MUFFINN for k changing from 1 to 100 at the increments of 1. True positive, false positive rates are with respect to the NCBI database (left) and the COSMIC database (right)

Fig. 7
figure 7

PR Plots including MUFFINN Results PR curves for the measures under consideration for k changing from 1 to 100 at the increments of 1. Precision and recall are with respect to the NCBI BioSystems database (left) and the COSMIC database (right)

Our proposed graph-theoretical measure M bw provides the largest AUROC and AUPR values with respect to both of the golden standard datasets. Even our second best measure M d e g1 provides better results than those of both M U F F I N max and M U F F I N sum . Note that the AUROC and AUPR values of M bw and M d e g1 are slightly different from those provided in Table 1. This is due to the fact that MUFFINN uses only genes in Concensus CDS. We filtered the reference golden standard databases to remove the rest of the genes not considered by MUFFFINN for a fair comparison, which led to slight differences in the values attained in the tests of M bw and M d e g1.

Filtering the M bw list

Since M bw is the best performer among all the employed measures, we employ a detailed inspection of its output. The top 50 genes with respect to M bw are listed in Table 3 in descending order of their weights, as shown in the W bw column. We first apply the MWIS heuristic on the node-weighted PPI network to implement the filtration. The rows of Table 3 that are marked with bold correspond to filtered nodes, that is they are in the MWIS output. The column marked with N provides the number of normal samples including the gene as an expressed gene, the column marked with T provides the corresponding number for tumor samples, the column marked with M provides the number of tumor samples the gene occurs as mutated, the column marked with G S 1 indicates whether the gene is listed in the first golden standard dataset, NCBI BioSystems, the column marked with G S 2 provides the analogous information regarding the COSMIC database, and finally the last column provides the list of genes presented in the table that are in the MWIS of the W bw -weighted PPI network and that are neighbors of the given gene in the network. As a sample Fig. 8 provides the neighborhood subgraphs of the top four MWIS genes of the list. Each subgraph is induced by the protein corresponding to the center node and its neighbors in the PPI network. Nodes are weighted with corresponding W bw values. The labeled nodes in the periphery are those in the top 50 list, but are filtered out from MWIS since the central node is included in MWIS.

Fig. 8
figure 8

Gene Neighborhoods Top four nodes in the filtered M bw list of proteins and their neighborhood subgraphs. The labeled nodes in the periphery are those in the top 50 list, but are filtered out due to the central node being selected in the MWIS

Table 3 Top 50 genes with respect to M bw

A literature review of the proteins resulting from filtration that are marked in bold in the table reveals that almost all of them play significant roles in breast cancer. We provide a review of each such protein not verified by either of the employed golden standard datasets. IKBKE has been shown to be a breast cancer oncogene via integrative genomic approaches [43]. More recently, Sang Bae et al. have shown that CK2/CSNK2A1 phosphorylates SIRT6 and is involved in the progression of breast carcinoma [44]. MDFI is considered a candidate tumor suppressor gene involved in cellular and viral transcriptional regulation [45]. TK1 is a widely accepted biomarker for cancer [46]. Roosmalen et al. have suggested SRPK1 as a breast cancer metastasis determinant via tumor cell migration screen [47]. The relationship between MAP3K1 and breast cancer detailing the possible mechanisms MAP3K1 mutations affect pathways important in breast carcinoma has been discussed in [48]. The role of PTN in the malignant progression of breast cancer is well established since early work [49]. The role of TNFRSF1B in triple-negative breast cancer (TNBC) has been studied in [50]. It is suggested that MAP3K3 contributes to breast carcinogenesis and MAP3K3 may prove to be a valuable therapeutic target in patients MAP3K3-amplified breast cancers [51]. KDM1A/LSD1 is suggested as a predictive marker for breast carcinogenesis and a novel attractive therapeutic target for treatment of ER-negative breast cancers. PIK3R3 is identified as one of the crucial genes for regulating triple negative breast cancer cell migration [52]. It is shown that HLA class I expression, including HLA-B, in breast cancer was significantly associated with nodal metastasis, TNM, lymphatic invasion, and venous invasion [53]. Furlan et al. have shown, in vitro and in vivo, an unsuspected facet of ETS1 in breast tumorigenesis. They show that while promoting malignancy through the acquisition of invasive features, ETS1 also attenuates breast tumor cell growth and could therefore repress the growth of primary tumors and metastases [54]. Due to the NR4A1-dependent regulation of T G F β signaling, NR4A1 is considered to promote breast cancer invasion and metastasis [55]. It is shown that PLSCR1 binds to onzin, a negative transcriptional regulatory target of c-Myc regulating cell proliferation which potentially implicates the role of PLSCR1 in cancer cell survival and proliferation [56]. HSPB1 downregulation in human breast cancer cells has been shown to induce upregulation of PTEN, a tumor suppressor gene [57]. Human Pirh2 (p53-induced RING-H2 protein) is encoded by the RCHY1 gene. Decrease of Pirh2 expression in the breast cancer cells result in reduced tumor cell growth via the inhibition of cell proliferation and the interruption of cell cycle transition [58]. It is suggested that TFAP2C overexpression correlates with poor overall survival after 10 years of diagnosis of breast cancer [59]. Koo et al. have proposed that RIPK3 deficiency is positively selected during tumor growth/development in breast cancer [60].

In addition to these genes already verified by relevant literature, the MWIS genes in the top 50 list contains three novel genes with indefinite associations to breast cancer: MAP3K14, MAPK8IP2, and PRKAB1. Although not verified by literature, the M bw measure suggests these three as candidate breast cancer genes that deserve further investigation.

Conclusion

We defined a framework to evaluate the performances of several network measures in differentially identifying cancer-related genes on tumor versus normal network instance pairs. We applied this framework on the breast cancer data. Two separate classifications of the network measures are defined; local/global and labeled/unlabeled. We demonstrate that on the available data, the local/global classification is not as reliable a source for separating the good performing measures from bad ones as the labeled/unlabeled classification. Unlabeled network measures surprisingly outperform labeled ones. The best performing measure is based on betweenness centrality, a global and unlabeled network measure. Applying the measures employed in this study to instances from various other types of cancer is part of the planned future work. Extending the defined measures to node-weighted, edge-weighted graphs, where a node weight represents the expression level of the corresponding gene and the edge weight represents the confidence attributed to the corresponding interaction in the PPI network may also provide valuable information in terms of cancer-related genes identification. We finally note that the main purpose of MWIS filtration is to compress the list of all scored genes into a shorter list of genes, for detailed inspection, such as in the form of literature verification as is done in this study. Although such a compression is not done blindly, by simply taking the top 50 genes for instance, and the effects of guilt-by-association are taken into consideration through the heuristic idea of independent sets for providing true positives, the compressed list can be susceptible to error in terms of false negatives. Due to the nature of independent sets, at most one of the two possibly high scoring genes is provided for every interacting pair. Thus further biological evaluations could focus on such high scoring pairs with one gene present, the other absent in the compressed list, and the significant genes in gene neighborhoods as in Fig. 8 for further simultaneous inspections.