Background

Cancer driver mutations are DNA changes that are causally implicated in oncogenesis [1, 2]. Typically between two and eight mutations, targeting several cellular pathways, are needed for cancer to develop [3]. To disrupt a single pathway or a group of functionally related genes in a way that promotes cancer growth, often only one mutation is needed [46].

Most DNA mutations are not cancer drivers. Mutations in DNA accumulate throughout life – for example, comparing skin or gastrointestinal epithelium cells in cancer samples from patients 85 and 25 years old showed that the younger patient on average has half the number of mutation compared to the older patient. More than half of all mutations found in cancer tissue are estimated to have occurred before the start of the disease [7]. Additionally, mutation rate tends to increase in cancer cells [8], although it can differ significantly even among subclones within the tumor [9]. Most of these new random mutations do not contribute to the progression of the disease. An analysis of large number of cancer samples gathered in the Cancer Genome Atlas (TCGA) [10] shows that the total number of mutations present in a tumor tissue from a single patient can range from 10 to more than 100, and only about 2 to 6 among them are driver mutations [11]. Hence, the majority of mutations present in a cancer tissue sample are passenger mutations, with no positive impact on oncogenesis. Due to the potential of using driver genes, that is, genes that harbor driver mutations, for therapeutic, prognostic, or diagnostic purposes, assembling a comprehensive catalogue of driver genes an important ongoing endeavor [1214]. The main challenge in this task is discovering new driver genes while avoiding false positives stemming from the abundance of passenger mutations.

Statistical and computational methods for detecting driver genes often rely on finding certain pattern of mutations in a group of driver genes across a cohort of patients. To develop cancer, multiple cellular functions must be perturbed, and in different patients, different genes with the same function may be mutated. Often, the cancer develops and is detected before a second mutation in genes with a given function occurs. Thus, for a given cancer type, for a group of patients, each patient would have at least one mutation in a functionally-related group of driver genes, but rarely would have more than one mutation – the gene set exhibits mutual exclusivity pattern of mutations. Several methods detect a set of driver genes by quantifying mutual exclusivity, including Dendrix [15] and Multi-Dendrix [16], RME [17], CoMEt [18], TiMEx [19], MEMo [20], and our own method, QuaDMutEx [21]. An alternative approach involves knowledge of networks linking genes. Frequently mutated genes and their less-frequently mutated neighbors in known human gene- or protein-level pathways or networks are detected as drivers. Methods such as HotNet2 [22, 23], MEMo [20] and DriverNet [24] adopted the network-oriented driver detection approach.

Biological network connectivity and mutual exclusivity are both important sources of information in discovering driver genes. At the same time, both types of information must be used with caution. The available biological networks are incomplete and are expected to include false positives, which might affect the true structure of the network in a way that is unknown. Deviations from mutual exclusivity pattern are expected in individual patients, especially in slow-growing tumors where random mutations have more time to accumulate before cancer is detected. Therefore, an algorithm that uses biological networks and mutual exclusivity at the same time will be able to utilize two complementary, imperfect sources of information to improve the quality of the discovered putative driver gene sets.

We propose a tool, QuaDMutNetEx, which combines the network and exclusivity based approaches. As in our previous tool QuaDMutEx [21], the objective function that is used to select driver genes penalizes for any deviation from the mutual exclusivity pattern. Additionally, QuaDMutNetEx shows preference for genes that are connected in known human biological networks. Compared to mutual exclusivity-based tools such as QuaDMutEx or Dendrix, this additional source of information can help in finding rare driver mutations, for which neither the network connectivity and mutation frequency alone, nor exclusivity alone, are selective enough. The tool models both the network and the mutual exclusivity terms of the objective function as convex, quadratic terms, resulting in a binary quadratic problem, which is solved using our previously proposed technique of efficiently exploring the space of gene sets by using stochastic search through a series of globally optimal solutions to subproblems. Comparisons with existing state-of-the-art methods on four cancer datasets show that the approach of combining network and exclusivity approaches results in improved ability to detect highly connected, mutually exclusive rare driver genes.

Results

We evaluated QuaDMutNetEx using its default parameters that have been selected experimentally: the maximum size of the gene set is ν=50; k=1, indicating equal preference for optimizing coverage and excess coverage; C=2.5; the network parameter was set to α=0.3; the number of iterations was set to T=10,000. In the network-connectivity term of the objective function, we used combined three human protein-protein interaction networks previously used in HotNet2 [23]. The first network is the iRefIndex network, which consists of 91,872 interactions among 12,338 proteins. The second network is MultiNet network which consists of 109,597 interactions among 14,445 proteins. The last network is HINT+HI2012 which is created by considering two interactome databases: HI-2012 data in human HI2 Interactome database, HI2012, and high-quality interactomes database, HINT. The HINT+HI2012 network consists of 40,783 interactions among 10,008 proteins.

We used four datasets to evaluate the proposed algorithm: triple-negative breast cancer (TN), glioblastoma multiforme (GBM), high-grade serous ovarian cancer (HGS), and another breast cancer (METABRIC) dataset (see Table 1). These datasets were previously used in evaluating the DriverNet tool. Following standard practice, known hypermutated genes such as mucins, titin, olfactory receptors, which are unlikely to play role in cancer, were removed [25].

Table 1 Summary of DriverDB tool datasets used in experimental validation of QuaDMutNetEx

Quantitative and qualitative assessment of QuaDMutNetEx results

The results of the tests, presented in Table 2, show that the proposed method returns gene sets that are statistically significant at 0.05. To assess statistical significance of the results returned by QuaDMutNetEx, we used permutation test proposed in [15]. In short, we randomly permuted the patient-gene matrix in a way that preserves the number of mutations in each patient, and in each gene. This process results in a randomized dataset in which any correlations of mutations are only appearing by chance, but the gene mutation frequencies and patient mutation counts are the same as in the original dataset, which keeps the randomized dataset similar to the original. We created 256 randomized datasets and ran QuaDMutNetEx on each dataset. To obtain a p-value estimate, the final penalty score obtained from running QuaDMutNetEx on the original dataset was compared with the distribution of final penalty scores from running QuaDMutNetEx on the 256 randomized datasets.

Table 2 Quantitative characteristics of QuaDMutNetEx results

The genes discovered by QuaDMutNetEx are presented in Table 3. To evaluate the gene’s driver status, we used COSMIC Cancer Gene Census database [26, 27] and the cancer driver gene database DriverDBv2 [28]. To check QuaDMutNetEx’s effectiveness in discovering rare cancer drivers, we focused on the genes in the solutions that are least frequently mutated in the datasets, and preformed literature review to analyze if these are true or false positives. Additionally, we visually evaluated the resulting gene networks – the largest connected component for each dataset is presented in Fig. 1. To show how inclusion of the network connectivity term in the objective function improves the solution, we have denoted genes found on the same datasets by our mutual-exclusivity-only method, QuaDMutEx. We see that most of the discovered driver genes, especially those mutated with low frequency, result from including the network connectivity term.

Fig. 1
figure 1

Known interactions between driver genes discovered by QuaDMutNetEx on the four datasets: TN: triple-negative breast cancer, GBM: glioblastoma multiforme, HGS: high-grade serous ovarian cancer, and METABRIC: breast cancer

Table 3 Putative driver gene sets discovered by QuaDMutNetEx

In the triple-negative breast cancer (TN) dataset, out of thirteen identified driver genes, eight are each mutated in only two out of 94 patients, and are analyzed below. A chromatin-remodeling gene CREBBP was found to be overexpressed in breast cancers [29], and is frequently mutated in bladder cancers [30]. DAPK1 is a potential tumor suppressor gene [31, 32]. NCOA1 was found to promote angiogenesis in breast cancers [33]. SLC39A7 is a potential oncogene in colorectal cancer [34]. IDH3B is upregulated in breast cancer and is significantly involved in energy metabolism in tumor progression [35, 36]. HIST1H4A is known to play a role in cell death induction in tumor cells [37]. HIF1A functions as a tumor promoter in cancer-associated fibroblasts, and as a tumor suppressor in breast cancer cells, also it is already a vaccine target in triple-negative breast cancer [3840]. Finally, MLL methyltransferase are known to have a haematopoietic-specific tumorigenic capability [41].

In the ovarian cancer (HGS) dataset, twenty-three out of twenty-five identified genes are low-frequency driver genes – each is mutated only in two out of 316 patients. Of these twenty-three genes, CTNNB1 is implicated in malignant ovarian transformation [42]. DAG1 and HSPA5 are already drug targets [43, 44]. ERBB2, MST1R, STAT3, VAV3, ERBB3, NTRK2 and JAK2 are known oncogenes [4550], and FANCA is a potential oncogene [51]. GRB2 is a therapeutic target for solid tumor prevention [52]. PIK3R1 represents a critical driver of endometrial cancer pathogenesis and is a therapeutic target [53]. TSHR signaling promotes the proliferation of ovarian cancer [54]. HSP90AA1 is considered as a potential protein target in therapy of ovarian cancer. [55]. UBC is potential drug resistance-related gene in ovarian cancer [56]. Finally, WRN and CDKN2A are tumor suppressor genes [57, 58].

In the glioblastoma multiforme (GBM) dataset, out of six identified driver genes, one is mutated in five, and two in only two out of 120 patients. Of these, MAPK9 was found to be significantly upregulated in glioma stem cells [59]. MDM2 is a known oncogene [60], while RPL11 is a tumor suppressor gene that acts together with MDM2 in p53 activation pathway [61]. In the METABRIC breast cancer dataset, out of twenty five genes identified by QuaDMutNetEx, six are very rare – each mutated in only two out of 696 patients. Of these, JAK1 is known for its key role in breast cancer progression [62]. EGFR signaling pathway also has a crucial role in mammary cancers [63], and polymorphism in the EGFR ligand, EGF, was found to affect cancer progression [64]. PIK3R1 and VAV1 are known oncogenes [65, 66], SYK is a tumor suppressor gene [67], and PTPN6 has a tumor suppressor role [68]. Together, these results confirm that QuaDMutNetEx is highly effective in identifying cancer driver genes with low mutation frequency.

For comparison, we used two network-based methods, DriverNet and HotNet2. We also used a mutual exclusivity tool, Dendrix. We ran the three tools on the same four datasets: TN, GBM, HGS, and METABRIC. DriverNet was designed to utilize genomic, transcriptomic, and biological network information, HotNet2 utilizes genomic and biological network information, while Denrix used only the genomic information. In all three methods, as well as in QuaDMutNetEx, we used the default parameters. For each method, we analyzed coverage, excess coverage, conformity to the mutual exclusivity pattern as quantified by the Dendrix score \(n - \sum _{i=1}^{n} |G_{i} x - 1|\), and the number of connected components in the subgraph of the biological network consisting of the genes in the solution returned by the method.

Testing in cancer molecular subtypes dataset

Mutual exclusive pattern in tumor can be resulted from other factors [69]. Hence, methods using mutual exclusivity need to account for that. Here we are using GBM subtypes to test the effectiveness of our method [69]. Using copy number, gene expression and methylation, GBM classified into proneural, neural, classical, and mesenchymal [70, 71]. We downloaded GBM data from TCGA and divided them into four subtypes according to TCGA IDs given in [71].

The genes discovered by QuaDMutNetEx are presented in Table 4. To evaluate the gene’s driver status, we used COSMIC Cancer Gene Census database [26, 27] and the cancer driver gene database DriverDBv2 [28]. All of the resulted genes exist in both COSMIC and DriverDB2 or one of them.

Table 4 Putative driver gene sets and metrics in GBM subtypes discovered by QuaDMutNetEx

Comparison with existing methods

The quality of the solutions returned by QuaDMutNetEx is higher than solutions from other methods (see Table 5). QuaDMutNetEx consistently produces more mutually exclusive gene sets than the network-based methods, and the gene sets are more highly connected – the interaction networks have fewer separate connected components. Compared to Dendrix, the tool that only considers mutual exclusivity, QuaDMutNetEx solutions show slightly lower mutual exclusivity, but at the same time are much more highly connected.

Table 5 Comparison between QuaDMutNetEx, HotNet2, DriverNet, and Dendrix

Effects of parameters on QuaDMutNetEx

The behavior of proposed method can be adjusted by three parameters, based on the knowledge of the tumor under study. Parameter α quantifies the reward for gene connectivity in cellular networks – higher values indicate stronger preference for finding densely connected genes. Parameter k controls how steeply the penalty for multiple mutations in a single pathway grows – lower values of k lead to lower penalization of excess coverage in relation to coverage, and is appropriate for slower growing tumors, where additional mutations in any given pathway have more time to accumulate by chance. Finally, higher values of parameter C penalize for solutions sets with many genes.

We have analyzed how these parameters affect the solution by running QuaDMutNetEx for 100,000 iterations for parameters α=0.01,0.05,0.1,0.3,0.6,1 with C=0.25,0.5,1,1.5,2,2.5 and α=0.01,0.05,0.1,0.3,0.6,1 with k=0.25,0.5,1,1.5,2,2.5. Figure 2 shows that the parameter α achieves its design goal, that is, solution with higher α include fewer connected components and prefer connected network. The α parameter has the following effect on coverage and excess coverage: as the value of α increases, the coverage decreases and the excess coverage increases. Furthermore, as the value of α increases, it decreases the effect of C and k. Setting α to a low value, such as 0.001, makes the effect of C and k to be more dominant. Higher coverage and higher excess coverage, as expected, are observed for low k values. High values of C lead to solution sets involving only a few genes, while low values of C lead to high coverage.

Fig. 2
figure 2

Effects of parameters on QuaDMutNetEx. a, b: effect on connected components; c, d: effect on coverage; e, f: effect on excess coverage. Results shown are for the HGS dataset, the results for other datasets are similar

Discussion

The proposed method, QuaDMutNetEx, relies on two sources of information to detect cancer driver genes. It uses observed somatic mutations in a cohort of cancer patients, to detect mutual exclusivity patterns, and a biological network encoding functional relationships between genes to provide context for the observed data. Relying on two sources of information is a strength of the proposed method, since treated individually, each source is imperfect. Biological networks are know to be incomplete and contain false positives, and the functional, regulatory, and signaling relationships they capture are not all directly relevant to cancer. Mutual exclusivity patterns may not be perfectly present in the observed patient mutation data. This may be true especially for slower growing tumors, where the time from onset of the disease to its detection is long enough to allow for accumulation of additional mutations in functionally-related sets of genes that contribute to cancer. Depending on the knowledge of the analyzed type of cancer and characteristics of the studied patient cohort, the users of QuaDMutNetEx should adjust the parameters of the methods that govern the strength of preference for mutual exclusivity in relation to patient coverage, the weight assigned to the network knowledge, and the strength of preference for small driver gene sets. Users of QuaDMutNetEx should also keep in mind that it uncovers driver genes relevant to the cohort from which the mutation data comes from. That is, it detects drivers present in the particular set of patients, based on the particular type of mutation data provided.

Conclusions

Experiments on four datasets show that QuaDMutNetEx has the ability to detect driver genes mutated with low frequency, genes that may be missed by existing tools that rely on mutual exclusivity alone, or on frequency and network information alone. Improvements in the quality and interpretability of the discovered putative driver gene sets makes QuaDMutNetEx a valuable addition to the family of driver discovery tools.

Methods

Input for QuaDMutNetEx

QuaDMutNetEx input has two sources of information. The first source is the binary somatic mutation matrix as in many mutual-exclusivity tools [15, 21]. Specifically, the data for n patients, each with total of p genes explored for possible existence of somatic mutations, arrives in a form of a mutation matrix G, an n by p sparse binary matrix. We expect Gij=1 if patient i has a somatic mutation in gene j, that is, a difference between cancer tissue and matched healthy tissue from the same patient is detected; otherwise, Gij=0. The change can be a point mutation in the coding region of the gene, potentially affecting its function. It could also be a mutation in the non-coding, regulatory elements of the DNA associated with the gene, or copy number alternation of the gene in case of homozygous deletions and high-level amplifications, affecting its expression. A row of the matrix describing mutations in patient i will be referred to as a vector Gi. The second source of information is a gene connectivity matrix A, with nonzero Aij values for genes i and j that are known to be related in a biologically meaningful way, for example one gene regulates the other, or proteins encoded by the genes are known to interact in a signaling pathway. The output of the method is a column vector x of length p, with xj=1 indicating that gene j is a putative cancer driver gene, that is, its mutations can contribute to cancer growth, and zero otherwise. The non-zero elements of the solution will be referred to as the solution gene set.

Defining the quality of potential driver gene sets

For the binary solution vector x with length p genes, there are 2p−1 possible non-zero solution vectors, each encoding a different gene set. The challenge is to find a gene set that is composed of driver genes. To this end, we designed a penalty score that reflects how likely it is that genes encoded by a solution vector form a comprehensive set of driver genes impacting a single cellular function. The lower the penalty score, the more likely the solution consists of related driver genes. The penalty score has two major terms: a network term, and a mutual-exclusivity term.

The network terms captures our preference for solution gene sets in which the products of transcribing the genes are connected in known human protein-protein interaction networks. Other sources of pairwise gene relationships could be used, for example functional similarity, or regulatory interactions. This network connectivity preference is captured by a term N(A,x) in the objective function, where A is the undirected adjacency matrix corresponding to the network, and x is the solution gene set. The new term introduces a reward for any two genes in a solution that are connected. Since the solution vector is binary, the additional term can be defined as \(N(A,x)=-x^{T}Ax=-\sum _{i,j} A_{ij} x_{i} x_{j}\). The scaled term αN(A,x) with nonnegative weight α corresponds to providing a reward of α every time two genes i and j present in the solution, that is, with xi=1 and xj=1, are connected by an edge, that is, when Aij=1. The effect of introducing the network term can be seen in Fig. 3.

Fig. 3
figure 3

Illustration of the role of the network term N(A,x). Based solely on the mutual exclusivity, potential solutions 1 and 2 are equally good, both show perfect mutual exclusivity. Inclusion of network term N(A,x) makes potential solution 2 the preferred one, since it consists of more highly connected genes

The second term in the objective function captures mutual exclusivity pattern among solution genes. We use a flexible, quadratic term previously used in our mutual exclusivity method, QuaDMutEx [21]. Briefly, the term penalizes for solutions that leave some patients not showing any mutation in the solution genes, as well as for solutions in which patients are covered by more than one mutation. The penalty for excess mutations grows quadratically with the number of mutations over one. The ration of penalty for multiple mutations to penalty for no mutation can be tweaked by parameter k. For example, for a slow growing tumor, where there is ample time for additional mutations to accumulate in a single pathway, k should be low. In addition, we define parameter C to be a penalty incurred by adding one more gene to the solution set.

For any possible solution vector x, the penalty score is a sum of the two terms described above, and is

$$ {\begin{aligned} L(G,A,x)&=-\alpha x^{T} A x + \sum_{i=1}^{n} \frac{1+k}{2}\left(G_{i} x - 1\right)\left(G_{i} x - \frac{2}{1+k}\right) + C ||x||_{0}. \end{aligned}} $$
(1)

Algorithm for finding high-quality driver gene sets

The minimization of the quadratic penalty function L(G,A,x) over binary vectors x is an example of an unconstrained binary quadratic problem (BQP). While BQPs are known to be NP-hard [72], for problems small enough the optimal solution can still be found. For example, for datasets with up to 1000 patients, if one focuses on only ν=50 genes, the solution x that is the global minimum of L(G,A,x) can be found in below a second.

As we have shown before [21], high-quality approximate solutions to BQP problems involving thousands of genes can be found efficiently by an iterative algorithm that maintains an evolving set of size ν consisting of candidate driver genes, and in each of the T iterations finds an optimal solution to a small instance of the problem in Eq. 1 involving only the current candidate genes. This allows for improving the quality of the driver gene set, while exploring a diverse set of possible genes as candidates.

A single run of QuaDMutNetEx will return a set of functionally-related driver genes with high mutual exclusivity and high network connectivity. Running QuaDMutNetEx in sequence, removing discovered genes from input matrices G and A after each iteration, will allow to uncover genes from multiple pathways needed for oncogenesis, although the joint solution is no longer expected to have high connectivity, nor to conform to the mutual exclusivity pattern.