Introduction

cancers, complex diseases with high lethality rates and heterogeneity, are caused by the clonal proliferation of cells, attributing to the selective growth advantage coming from gene mutations [18, 19, 21]. Nevertheless, the majority of mutations are passenger ones and are irrelevant to cancers, they are biologically neutral and do not confer a growth advantage on the cell where they occur. That is to say, only a minority of the mutations, called driver mutations, have been subject to the positive selection and are casually implicated in the clonal proliferation of cells, contributing to the formation and progression of cancers [49]. It will shine a light on cancer pathogenes to differentiate driver genes from passenger ones [38, 40]. Furthermore, studies have demonstrated that driver genes are generally engaged in some critical cellular signaling or regulatory pathways of the human body [24], any one aberrated driver gene is generally enough to disturb the signaling pathway it involves in and leads to the generation of cancer cells. This may illustrate why high heterogeneity exists in cancers. Consequently, it is significant for exploring the heterogeneity to investigate gene mutations in terms of pathway-level instead of gene level [5, 16]. With the rapid development of high-throughput sequencing technology, incredible amounts of cancer omics data have been collected by such cancer genome sequencing projects as the Cancer Genome Atlas (TCGA) [10], and the International Cancer Genome Consortium (ICGC) [30]. It has become realistic to economically detect driver pathways or driver modules (a set of driver genes enriched in cancer-related biological pathways) by using computational methods [14, 25, 26, 64].

A number of studies have been conducted in the identification of driver pathways or driver modules. One kind of approaches, namely de novo identification [52, 63, 65], conduct detection from just genetic data by virtue of the fundamental features of driver pathways or driver modules. The other kinds of ones, namely priori knowledge-based identification approach [1, 34, 42], exploit the known interactions between genes/proteins in addition to genomic data. The study focuses on the latter one.

Among the methods based on prior knowledge, most ones apply the intrinsic topology of biological networks to the identification. The HotNet2 method [42] performs an insulated thermal diffusion process with gene mutation frequencies as well as gene interactions, and constructs a weighted graph for identifying driver modules. The Mutex method [4] hunts for sets of mutually exclusively mutated genes, sharing a common downstream target, from a great gene network. Ahmed et al. [1] regarded that methods which conducts identification employing only mutation frequency may neglect some driver modules with low mutation. Different from the HotNet2 method, their proposed MEXCOWalk method [1] weights the protein-protein interaction (PPI) network in terms of the mutual exclusivity among genes besides the gene mutation frequencies, and conducts an insulated heat diffusion process based on the weights of both vertices and edges. In 2021, Wu et al. [55] pointed out that the amounts of noise contained in biological networks would have unavoidable negative impacts on the performance of identification, and filtered it out by introducing subcellular localization data. Additionally, they devised a parthenogenetic algorithm to solve their proposed recognition model, constructed by introducing hops between genes within a module besides adopting coverage, mutual exclusivity, and network connectivity. The next year, they claimed that attention should be paid to the discrepancy of mutation frequency among different cancers [56]. The HMCEwalk method proposed by them, identifying modules based on a random walk process weights the integrated PPI network by using the harmonic mean of scores concerning coverage as well as mutual exclusivity. In the same year, Wu et al.[54] presented the ECSWalk method based on the method MEXCOWalk, it weights gene interactions in a complex biological network in terms of the similarity of node topological structure besides coverage and mutual exclusivity between mutation genes. There are also some studies attempting to reconstruct or alter the topology of biological networks. The MEMo method [12] builds a graph with edges indicating the functional similarity between a pair of genes, and outputs cliques exhibiting patterns of mutual exclusivity. The MEMCover method [35] uncovers pan-cancer dysregulated pathways from an adjusted functional interaction network in which the interactions fall into the ACROSS\ME level.

Among the above-mentioned identification methods, none of them except IDM-SPS, proposed by Wu et al. [55], has focused on the latent noise such as false positive interactions in biological networks, which may caused by a less precise confidence interval for classification [31, 32]. In this paper, studies are conducted on alleviating the negative effects of noise in virtue of other omics data. We begin with constructing a weighted protein-protein interaction (PPI) network with the aid of the gene-microRNA network as well as the somatic mutation profile, and generate gene feature vectors with the graph embedding method Node2vec. Then a set of gene clusters are produced with DIvisive ANAlysis (DIANA) hierarchical clustering algorithm [48]. Finally, the set of gene clusters are processed based on gene influence to obtain the final set of cancer driver modules. The major contributions are as follows: (1) Introduce a new evaluation index to weight the PPI network. (2) Present the vertices of a PPI network into a low-dimensional vector space with graph embedding methods Node2vec. (3) Devise a heuristic dropping and extracting process on a set of gene clusters, generated from clustering the genes based on their low-dimensional feature vectors. (4) Conduct extensive trials with real pan-cancer datasets, and compare the identification performance with that the other advanced methods Hotnet2, MEXCOwalk, ECSwalk, and HMCEwalk.

Definitions and notations

Given a set of cancer samples S={\(s_i\vert i\)=1, 2, ..., m} as well as a group of mutated genes G={\(g_j\vert j\)=1, 2, ..., n}, let \(A_{m\times n}\) be a binary somatic mutation matrix recording whether gene \(g_j\) mutates in sample \(s_i\) or not, i.e., \(a_{ij}\)=1 (i=1, 2, ..., m, j=1, 2, ..., n) if gene \(g_j\) mutates in sample \(s_i\), and \(a_{ij}\)=0 otherwise. Let PP=(V, E) represents a connected PPI network, where the vertex set V records the proteins expressed from the genes in G (\(n=|V|\)), and edge set E records the undirected interactions between the proteins. For simplifying the later description, the vertex in PP is represented with its corresponding gene \(g_j\) (\(g_j\in G\)). Let PM=(\(V^g\), \(V^m\), \(E^{gm}\), \(W^{gm}\)) denote a gene-microRNA network, where each vertex \(v^g_j\in V^g\) represents a gene (corresponding to the gene \(g_j\in V\)), each vertex \(v^m_k\in V^m\) represents a microRNA, and each edge \(e^{gm}\)(\(v^g_j\), \(v^m_k\))\(\in E^{gm}\) has a weight \(w^{gm}\)(\(v^g_j\), \(v^m_k\))\(\in W^{gm}\), measuring the relationship between gene \(v^g_j\) and microRNA \(v^m_k\). For each \(g_j\in V\), let \(S_j\) record the samples in which gene \(g_j\) is mutated:

$$\begin{aligned} S_j= {\left\{ \begin{array}{ll} \{s_i\vert a_{ij}=1 \}, &{}\text{ if } g_j\in G\\ \emptyset , &{}\text{ otherwise }. \end{array}\right. } \end{aligned}$$
(1)

Assume that M is a module composed of selected genes. The mutual exclusivity MEX(M) as well as the coverage COV(M) of M are defined in Formulas (2) and (3) [1]:

$$\begin{aligned} MEX(M)= & {} \frac{\vert \bigcup _{\forall g_i\in M} S_i\vert }{\sum \limits _{\forall g_i\in M }\vert S_i\vert }, \end{aligned}$$
(2)
$$\begin{aligned} COV(M)= & {} \frac{\vert \bigcup _{\forall g_i\in M} S_i\vert }{m}, \end{aligned}$$
(3)

where MEX(M)=1 means that the genes within M are completely mutually exclusive, i.e., each sample carries at most one mutation coming from the gene of M. COV(M)=1 indicates that each sample carries at least one mutation coming from the gene of M. Let P={\(M_1\), \(M_2\), ..., \(M_r\)} be a group of driver modules, where \(M_i\), \(M_j \subseteq P\), \(M_i \ne M_j\), i, j=1,2, ..., r, \(i \ne j\). The relative size of the module \(M_i\), namely \(RS(M_i)\), is formulated as follows:

$$\begin{aligned} RS(M_i)=\frac{\vert M_i\vert }{\vert \bigcup _{\forall M_t\in P}M_t\vert } \end{aligned}$$
(4)

Then for the group of driver modules P={\(M_1\), \(M_2\), ..., \(M_r\)}, let MS(P) and CS(P) measure the mutual exclusivity score and the coverage one[1], respectively, defined as follows:

$$\begin{aligned} MS(P)= & {} \sum _{\forall M_i\in P}MEX(M_i)\times RS(M_i) \end{aligned}$$
(5)
$$\begin{aligned} CS(P)= & {} {\left\{ \begin{array}{ll} \sum \limits _{\forall M_i\in P}\frac{COV(M_i)\times (1-RS(M_i))}{\sum \nolimits _{\forall _{M_t\in P}} 1-RS(M_t)}, &{}\text{ if } |P|>1\\ COV(M_1), &{}\text{ if } |P|=1. \end{array}\right. } \end{aligned}$$
(6)

According to the above definitions, an optimization problem for identifying cancer driver modules is depicted as follows: Given a PPI network PP, somatic mutation matrix A, gene-microRNA network PM, the total number of genes Totalg, and the minimum size of a module Mins, identify a group of non-overlapping modules P to maximize Driver Module Set Score DMSS(P), as shown in Formulas (7) to (10).

$$\begin{aligned}&\max DMSS(P)=MS(P)\times CS(P), \end{aligned}$$
(7)
$$\begin{aligned}&s.t.~PP(M_i)~is~connected,\forall M_i\in P,\end{aligned}$$
(8)
$$\begin{aligned}&\vert \bigcup _{\forall M_i\in P} M_i\vert =Totalg,\end{aligned}$$
(9)
$$\begin{aligned}&\min _{\forall M_i\in P}\vert M_i\vert =Mins. \end{aligned}$$
(10)

The ICDM-GEHC method

In this section, a method for Identifying Cancer Driver Modules by Graph Embedding and Hierarchical Clustering (ICDM-GEHC) is proposed. The method takes matrix A, PPI network PP, and gene-microRNA network PM as inputs, and produces a set of driver modules P as output. As shown in Fig 1, the method has four main steps, namely assigning weights, extracting features, clustering genes, and constructing driver modules. Each step is depicted detailedly in the four subsequent subsections.

Fig. 1
figure 1

The pipeline of method ICDM-GEHC

Assigning weights

It has been reported that microRNAs (miRNAs) exert critical functions in the progression and development of human cancers through regulating the expression of cancer-related genes [33]. In this paper, the gene-microRNA interaction network is introduced to weight the protein-protein interactions of a PPI network. For the convenience of description, let PP still represent the weighted PPI network, i.e., PP=(V,E,W), where w(\(v_j\), \(v_k\))\(\in W\) denotes the weight of edge (\(v_j\), \(v_k\)).

Given a pair of genes \(g_i\) and \(g_j\) (\(g_i\), \(g_j\in V\)), the confidence between them CF is defined as Formula (11):

$$\begin{aligned} CF(g_i, g_j)= {\left\{ \begin{array}{ll} \frac{\sum \limits _{\forall v_{k}^m\in NM_{ij}} [w^{gm}(v_{i}^g,v_{k}^m)+ w^{gm}(v_{j}^g,v_{k}^m)] }{2\times |NM_{ij}|}, &{}\text{ if } |NM_{ij}|>0,\\ \lambda \mu , &{}\text{ if } |NM_{ij}|=0, \end{array}\right. } \end{aligned}$$
(11)

where \(NM_{ij}\) records the microRNA neighbors common to genes \(g_i\) and \(g_j\). \(\mu \) is the arithmetic mean of the edge weights in PM (Formula (12)), and \(\lambda \) an adjustable parameter.

$$\begin{aligned} \mu =\frac{\sum \nolimits _{e(v^g_j,v^m_k)\in E^{gm}}w^{gm}(v^g_j,v^m_k)}{|E^{gm}|}. \end{aligned}$$
(12)

Let ME(\(g_i\), \(g_j\)) represent the mutual exclusivity between genes \(g_i\) and \(g_j\) (\(g_i\), \(g_j\in V\)), as calculated in Formula (13):

$$\begin{aligned} ME(g_i, g_j)=\frac{MEX(Ne(g_i))+MEX(Ne(g_j))}{2}, \end{aligned}$$
(13)

where Ne(x) records gene x as well as its direct neighbour genes:

$$\begin{aligned} Ne(x)=\{y|e(x, y)\in E\}\cup \{x\}, \end{aligned}$$
(14)

Then each edge of the PPI network PP is weighted as Formula (15):

$$\begin{aligned} w(v_i, v_j)={\left\{ \begin{array}{ll} ME(g_i, g_j)\\ \times COV(\{g_i\})\\ \times COV(\{g_j\})\\ \times CF(g_i, g_j), &{} \text{ if } ME(g_i, g_j)\ge \theta ,\\ 0, &{}\text{ otherwise }, \end{array}\right. } \end{aligned}$$
(15)

where \(\theta \) is the threshold of mutual exclusivity.

Extracting features

Given an undirected weighted graph PP=(V, E, W), the node embedding algorithm Node2vec [22] is adopted to learn continuous feature representations of the vertices. The feature extraction can be formulated into a maximum likelihood optimization problem:

$$\begin{aligned}&\max _f\sum _{\begin{array}{c} v_i\in V \end{array}}\Bigg [-log{\sum _{\begin{array}{c} v_j\in V\\ v_i\ne v_j \end{array}}\exp {(f(v_i)\times f(v_j))}}\nonumber \\&\quad +\sum _{\begin{array}{c} v_k\in N_s(v_i) v_i\ne v_k \end{array}}f(v_k)\times f(v_i)\Bigg ], \end{aligned}$$
(16)

where f(\(v_x\)) denotes the d-dimensional feature vector representation of vertex \(v_x\) (\(v_x\in V\)) obtained from a process of biased random walking, and \(N_s\)(\(v_x\)) records the network neighbours of vertex \(v_x\) generated with the neighbourhood sampling strategy, i.e., the sequence of vertices in the walking path starting from vertex \(v_x\). Assume that \(v_p\), \(v_c\) and \(v_n\) denote three successive vertices in a walking process, vertex \(v_n\) is chosen with a conditional probability of P(\(v_n|v_c\)):

$$\begin{aligned} P(v_n|v_c)= {\left\{ \begin{array}{ll} \frac{\alpha _{pq}(v_p, v_n)\times w(v_c, v_n)}{Z}, &{}\text{ if } e(v_c, v_n)\in E,\\ 0, &{}\text{ otherwise }, \end{array}\right. } \end{aligned}$$
(17)

where Z is a normalization constant, and \(\alpha _{pq}\) is the bias parameter, ascertained as Formula (18):

$$\begin{aligned} \alpha _{pq}={\left\{ \begin{array}{ll} \frac{1}{p}, &{} \text{ if } {v_p=v_n},\\ 1, &{} \text{ if } {e(v_p, v_n)\in E},\\ \frac{1}{q}, &{} \text{ if } {e(v_p, v_n)\notin E}, \end{array}\right. } \end{aligned}$$
(18)

where parameters p and q indicate whether Deep First Search (DFS) or Breath First Search (BFS) is adopted in the process of random walking.

Clustering genes

In this section, the DIANA hierarchical clustering algorithm [48] is implemented on the weighted PPI network PP to generate a set of gene clusters. Suppose that F={f(\(v_1\)), f(\(v_2\)), ..., f(\(v_n\))} records a set of n feature vectors of d-size, where f(\(v_i\))\(\in F\) represents the feature vector of vertex \(v_i\in PP\). Given f(\(v_i\)), f(\(v_j\))\(\in F\), the Mahalanobis distance [46] \(D_m\)(\(v_i\), \(v_j\)) is adopted to measure the similarity between vertices \(v_i\) and \(v_j\), as shown in Formula (19):

$$\begin{aligned} D_m(v_i, v_j)=\sqrt{(f(v_i)-f(v_j))^T\Sigma ^{-1}(f(v_i)-f(v_j))}, \end{aligned}$$
(19)

where \(\Sigma \) denotes the covariance matrix between vectors f(\(v_i\)) and f(\(v_j\)). The DIANA-based clustering algorithm is described in Algorithm 1.

Algorithm 1
figure a

Hierarchical Clustering Algorithm.

Constructing driver modules

Based on the set of generated gene clusters P, a dropping and extracting algorithm is designed to construct the driver modules. Suppose that \(v_i\in V\) is a vertex of PPI network PP, let NI(\(v_i\)) measure the node influence of vertex \(v_i\), as defined in Formula (20):

$$\begin{aligned} NI(v_i)=\frac{\sum \nolimits _{v_k\in V}w(v_i, v_k)}{|Ne(v_i)|-1}\times COV(\{v_i\}). \end{aligned}$$
(20)

As depicted in Algorithm 2, the algorithm iteratively drops the vertices with the lowest node influence, and extracts connective components in each cluster of P. Specifically, the iteration does not stop until the sum of genes in P is less than or equal to Totalg (Step 2 to Step 23). Each iteration begins with dropping the L vertices with the lowest NI(\(\cdot \)) scores from P and PP, and the edges related to them from PP (Step 2 to Step 11). Then for each cluster \(p_i\) in P, it is substituted with a set of connective components, each of which is extracted from \(p_i\) and with a minimum size of Mins (Step 13 to Step 23). The concrete description is illustrated in Algorithm 2.

Algorithm 2
figure b

Dropping and Extracting Algorithm.

Experiment results and analysis

To test the performance of method ICDM-GEHC, extensive experiments were implemented on real cancer datasets. The TCGA pan-cancer somatic aberration data were acquired from Ahmed et al. [1], consisting of 3110 cancer samples and 11565 genes of 12 cancer types. A widely used H.Sapiens PPI network HINT+HI2012 [13, 42, 60] were adopted, which contained 9859 vertices and 40705 edges. The genes co-existing in somatic mutation data and PPI network were retained, and the processed data are as follows: cancer sample number m=3110, gene number n=6930, edge number |E|=25251. The gene-microRNA network, obtained through feeding the mirDIP database [51] with the 6930 genes, was consisted of 229,135 interactions between 6145 genes and 2734 microRNAs.

We first tested the ICDM-GEHC method under different parameter settings, then compared its performance with four cutting-edge identification methods based on prior knowledge, i.e., Hotnet2 [42], MEXCOwalk [1], HMCEwalk [56], and ECSwalk [54]. All the experiments have been performed on a Workstation with an Intel i7-7700 CPU, 24 GB RAM, a Windows 10 system, and a Python 3.9.12 compiler.

Parameter settings

The settings of parameters adopted in the comparison methods were in consistent with the literatures [1, 42, 54, 56]: the mutual exclusivity threshold \(\theta \)=0.7, the probability \(\beta \)=0.4, and the minimum module size Mins=3. The total number of genes Totalg was set to {100, 200, ..., 2000} for methods Hotnet2, MEXCOwalk and ECSWalk, and {100, 200, ..., 900} for method HMCEwalk. In the ICDM-GEHC method, some parameters were set to the optimal values described in related literatures, i.e., \(\theta \)=0.7, Mins=3, node2vec parameters (p, q)=(4, 1)[1, 22]. Besides, a number of pre-experiments were performed to determine appropriate values for the other parameters required by method ICDM-GEHC.

In the experiments of determining clustering number K, the candidate values of K are calculated as Formula (21):

$$\begin{aligned} K=PNum\left( \bigg \lceil \frac{n}{ms}\bigg \rceil \right) , \end{aligned}$$
(21)

where function PNum(x) returns the nearest prime number of x, and \(ms\in \){10, 20, ..., 90} denotes the presumptive size of a module. Therefore, the candidate values of \(K\in \){701, 347, 233, 179, 139, 127, 101, 89, 79} corresponding to n=6930. The other parameters are tested as follows: \(\lambda \in \){0.25, 0.5, 0.75, 1}, \(d\in \){16, 32, 48, 64, 80, 96}, \(L\in \){1, 2, 3}, \(wl\in \){20, 40, 60, 80,100}, \(nw\in \){100, 200, 300, 400, 500}(wl and nw are two important parameters used in algorithm Node2vec, where wl represents walk length, and nw denotes the number of walks per node[22]). Figure 2a–f display the DMSS scores under different parameter settings. Based on the pre-experimental results, method ICDM-GEHC has the following parameter settings: \(\lambda \)=0.25, d=48, wl=80, nw=400, K=89, L=1.

Fig. 2
figure 2

The Driver Module Set Score (DMSS) with different parameter settings for ICDM-GEHC

Static evaluation

In this section, static evaluations were conducted in terms of a pair of reference gene sets, such as the COSMIC Cancer Gene Census (CGC) database [17], and the Network of Cancer Genes (NCG) [15]. As previous literature has performed, both Receiver Operating Characteristic Curve (ROC) [7] and Fold Enrichment analysis [2] were adopted to evaluate the capability of detecting known cancer genes, i.e., conducting a comparison between the union of genes in all recognized modules and a cancer reference gene set.

(1) Receiver Operating Characteristic Curve (ROC)

The ROC curve is created by calculating and plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various Totalg settings, i.e., each point on the curve indicates a pair of TPR and FPR obtained with a given Totalg. TPR (resp. FPR) is the ratio of the number of identified reference (resp. un-reference) genes to the total number of reference (resp. un-reference) genes, as shown in Formula (22) (resp. Formula (23)). TPR and FPR respectively indicate the sensitivity and the specificity of the method, reflecting the robustness of it [6].

$$\begin{aligned} TPR= & {} \frac{TP}{TP+FN}, \end{aligned}$$
(22)
$$\begin{aligned} FPR= & {} \frac{FP}{FP+TN}, \end{aligned}$$
(23)

where TP (resp. FP) counts the number of reference genes (resp. un-reference genes) identified as cancer-related genes, and TN (resp. FN) counts the number of un-reference genes (resp. reference genes) that are not identified as cancer-related genes.

(2) Fold Enrichment analysis

Fold Enrichment measures the ratio between the proportion of identified reference genes and the proportion of the identified genes, as calculated as Formula (24):

$$\begin{aligned} Fold\;enrichment=\frac{Recovered\times All}{Reference\times Selected}, \end{aligned}$$
(24)

where Reference counts the number of genes in the reference gene set, Recovered measures how many genes in the reference gene set are identified, All counts the number of genes (vertices) in the HINT+HI2012 network, and Selected denotes the sum of recognized genes. There were 616 and 591 genes contained in the CGC and the NCG databases, respectively, of which 436 and 410 ones were included in the somatic mutation data and the HINT+HI2012 network, respectively. Therefore, Reference was set to 436 for the CGC database, and 410 for the NCG one.

In Fig. 3, the ROC curves were compared among methods Hotnet2, MEXCOwalk, ECSWalk, HMCEwalk, and ICDM-GEHC based on databases CGC and NCG, where Totalg=100, 200, ..., 2000. The vertical dotted lines indicate the ROC values when Totalg=900 and 2000, respectively. The bracketed data in the legends represents the area under the ROC curve (the AUC value ), where \(A_1\) (resp. \(A_2\)) denotes the AUC value of Totalg between 100 and 900 (resp. between 100 and 2000). From this figure, we can observe that the ICDM-GEHC method achieves better identification performance than the other four methods, for it has produced the steepest curve among the five approaches. Take the CGC database for an example, when Totalg ranges from 100 to 900, method ICDM-GEHC acquires the highest AUC value among the five ones. When Totalg ranges from 100 to 2000, the AUC value of method ICDM-GEHC is 0.073, which is still higher than method Hotnet2 (0.055), MEXCOwalk (0.069), and ECSWalk (0.061).

Fig. 3
figure 3

The comparison of ROC curve among different methods

Tables 1 and 2 compare the fold enrichment analysis results based on databases CGC and NCG, respectively. It can be seen from the two tables that the fold enrichment obtained by the ICDM-GEHC Method is higher than or equal to that acquired by the other four methods in most cases.

Table 1 Fold enrichment analysis on the CGC dataset
Table 2 Fold enrichment analysis on the NCG dataset

In addition, the performance of detecting low mutation frequency genes was also evaluated for the five methods, i.e., the above analysis was implemented again under the condition that the reference gene has a mutation rate lower than 1% or 2%. For the CGC database, there are 291 genes with mutation rates lower than 1%, and 374 genes with mutation rates lower than 2%. For the NCG database, there are 266 and 352 genes accordingly.

Tables 3, 4, 5, 6 display the fold enrichments analysis results of various approaches on the two database. From these tables, we can discover that the ICDM-GEHC method exhibit better performance in most cases for reference gene frequency\(\le \)2%, while do not manifest significant advantage when the reference gene frequency is less than or equal to \(\le \)1%.

Table 3 Fold enrichment analysis on the CGC dataset (reference gene frequency\(\le \)1%)
Table 4 Fold enrichment analysis on the NCG dataset (reference gene frequency\(\le \)1%)
Table 5 Fold enrichment analysis on the CGC dataset (reference gene frequency\(\le \)2%)
Table 6 Fold enrichment analysis on the NCG dataset (reference gene frequency\(\le \)2%)

Table 7 compares the total execution time among the five approaches under the condition that Totalg=100, 200, ..., 900. The experiment results indicate that method ICDM-GEHC takes the longest time among these methods, followed by methods ECSWalk and HMCEwalk, and the least time-consuming methods are Hotnet2 and MEXCOwalk.

Table 7 The execution time under different Totalg (second)

Modular evaluation

As referred above, the static evaluation evaluates the performance of identification methods in terms of the union of genes within the detected modules. In this section, modular evaluations were further performed to assess the specific identified modules and their interrelationships based on such two indexes as the Driver Module Set Score (DMSS) and the Cancer Type Specificity Score (CTSS) [1].

The CTSS Score is adopted to estimate the cancer-type specificity of a group of identified modules P={\(M_1\), \(M_2\), ..., \(M_r\)}. Given \(M_i\in P\) and cancer type t, let \(S_{M_i}\) represent the set of samples that have at least one mutated genes belonging to module \(M_i\), i.e., \(S_{M_i}\)=\(\bigcup _{\forall g_j\in M_i}S_j\). Assume that \(S^t\) and \(S_{M_i}^t\) denote the subset of samples in S and \(S_{M_i}\) diagnosed with cancer type t, respectively. The probability \(p_i^t\) is calculated with a Fisher’s exact test from such four values as \(|S_{M_i}^t|\), \(|S^t-S_{M_i}^t|\), \(|S_{M_i}-S_{M_i}^t|\), and |S-\(S^t|-|S_{M_i}-S_{M_i}^t|\). It estimates whether a module \(M_i\) is specific to the cancer type t, and is used to calculate the CTSS score of P, as follows:

$$\begin{aligned} CTSS(P)=-\frac{\sum _{M_i\in P}{log({min}_{\forall t}(p_i^t))}}{r} \end{aligned}$$
(25)

where \(p_i^t\) is calculated as follows:

$$\begin{aligned} p_i^t=\frac{\begin{pmatrix} |S^t| \\ |S_{M_i}^t| \end{pmatrix} \begin{pmatrix} |S-S^t| \\ |S_{M_i}-S_{M_i}^t| \end{pmatrix} }{\begin{pmatrix} |S| \\ |S_{M_i}| \end{pmatrix}} \end{aligned}$$
(26)

In Fig. 4, the DMSS scores obtained by the five methods are illustrated. From this figure, we can observe that the ICDM-GEHC method obtains the highest DMSS score among the five recognition approaches under different Totalg settings. This demonstrates that the modules identified by method ICDM-GEHC exhibit better coverage and mutual exclusivity than those recognized by the other four methods.

Fig. 4
figure 4

The comparison results of DMSS scores

Figure 5 illustrates the comparisons of the CTSS score obtained by the five approaches. As conducted by Ahmed et al. [1], COLON and RECTAL tumors were grouped together, so that 11 cancer types instead of 12 ones were utilized. From this figure, we can discover that the ICDM-GEHC method still exhibits the best performance in terms of the CTSS score. It can acquire a higher CTSS score than the other methods for each Totalg setting except Totalg=200, 300, 400, suggesting that its output modules are significantly enriched for specific cancer types.

Fig. 5
figure 5

The comparison results of CTSS score

Analysis of ICDM-GEHC modules

Figure 6a illustrates the eight modules detected by method ICDM-GEHC when Totalg=100. The module sizes range between 3 and 25, and the coverage of the modules ranges between 5.24% to 69.99%. Node sizes are proportional with gene mutation frequencies, indicating each gene identified by method ICDM-GEHC has a mutation frequency greater than zero. An edge is colored black if it connects two genes belonging to the same module, and grey otherwise. The thickness of a line is in positive proportion to the edge weight. Color of a module represents the cancer type that has the highest enrichment for mutations in genes of that module. Each module is named after the gene with the highest mutation frequency in that module. Figure 6b exhibits the cancer type specificity, where the rows represent modules, the columns represent cancer types, and the colors of entries indicate the significance of enrichment for cancer types in terms of Fisher’s exact test p-values.

Fig. 6
figure 6

a The modules produced by the ICDM-GEHC method when Totalg=100. The legend for the color codes is displayed on the right. b The results of cancer type specificity

In Fig. 7, the heat maps of gene mutations in three different sizes of modules are displayed. Each column represents a cancer sample (the samples, that do not have mutations in any genes of the three modules, are not exhibited), and each row denotes a gene. The top-left scale indicates the quantity of mutations in a sample. The bottom-left scale as well as the right one denote the proportion and the quantity of samples with mutations on a certain gene, respectively. Distinct colors denote distinct kinds of cancer samples. They mean the same thing in Fig. 8 (see Appendix A). As can be seen from this figure, the quantity of samples covered by a module increases apparently with the increase of module size, i.e., 253 for the CCNE1 module, 523 for the KAT6A module, and 1007 for the EGFR module. Although the three modules exhibit satisfying mutual exclusivity, i.e., most samples mutate in just one gene of the module, it gets worse with the increase in module size. The proportion of samples having more than one gene mutations are 9.52% for the CCNE1 module, 19.69% for the KAT6A module, and 37.91% for the EGFR module. Furthermore, it is discovered the central gene, which has the highest coverage in a module, may not always be the one that has the greatest degree.

Fig. 7
figure 7

The heat maps of gene mutations on three different sizes of modules

As displayed in Fig. 6a, most of the detected modules, i.e., those centered at TP53, CCND1, MYC, PIK3CA, EGFR, and KAT6A, are part of pathways known to be associated with carcinogenesis. In the following analysis, the referred biological pathways are acquired from the KEGG database (https://www.genome.jp/kegg/).

The genes within module TP53 are primarily engaged in such four cancer-related pathways as p53 signaling pathway (PTEN, CDKN2A, TP53, CDK4), Neurotrophin signaling pathway (BRAF, AKT1, TP53), Pathways in cancer (PTEN, BRAF, CDKN2A, TP53, CDK4, AKT1), and PI3K-Akt signaling pathway (PTEN, AKT1, CDK4, TP53). The module involves in eleven cancer types with KIRC (Kidney Clear Cell Carcinoma) being the most specific one, for it possesses the lowest Fisher’s exact test p-value 3.19e\(-\)80. As the central gene of the TP53 module, TP53 exhibits high mutation frequency (41.51%) across the entire samples, while has a comparative low mutation rate (1.91%) in the KIRC samples. This is consistent with the report that gene TP53 mutates at a relatively low frequency in KIRC [53]. In this module, genes TP53 and PTEN share the highest edge weight, and demonstrate moderate strong mutual exclusivity, i.e., they respectively mutate in 1259 and 317 samples, and mutate in 95 ones simultaneously. In addition, several cancer-related genes with low mutation frequencies (<1%), connected directly with gene TP53 or gene PTEN, are recognized in this module, such as WRN, WWOX, HIPK2, BMP1, RNF20 and ZNF384. WRN takes part in the replication, repair, and recombination of DNA, and the Loss-of-function mutation in WRN results in genetic instability and cancer [9]. WWOX has been suggested to be able to exert its tumor suppressive activity, and the suppression of its expression can make cancer cells resistant to death [50]. HIPK2 could represent a significant prognostic marker, and even a therapeutic target [8]. Studies have identified that BMP1 is engaged in the progression of renal cancer as an independent predictor of prognosis in Clear cell renal cell carcinoma (ccRCC) patients [57], and RNF20 overexpression inhibits ccRCC cell proliferation through downregulation of SREBP1c [41]. ZNF384 participates in the genesis and development of tumors as a significant signal molecule [27]. The CCND1 module has the same specific cancer type as the TP53 module. Three cancer related pathways are engaged in as follows: p53 signaling pathway (MDM2, MDM4, CCND1, CDK6), Pathways in cancer (RB1, MDM2, E2F3, CCND1, CDK6), and PI3K-Akt signaling pathway (MDM2, CCND1, CDK6). CTNNB1 shares relatively great edge weight with two cancer genes RB1 and CDK6. The three genes exhibit extremely high mutual exclusivity, they respectively mutate in 89, 117 and 47 samples, while only 5 samples have more than two mutations of them.

Both MYC and SMAD4 module are specific for cancer type CRC (Colorectal Cancer). Besides Colorectal cancer, several genes (MYC, APC, CTNNB1, TCF7L2) in module MYC are also engaged in the following cancer-related pathways: Pathways in cancer, Hippo signaling pathway, Wnt signaling pathway, and Endometrial cancer. In addition, the former three genes exhibit the top three coverage in this module (as shown in Fig. 1d), and have satisfying mutual exclusivity. They mutate in 283, 226 and 89 samples respectively, while just 32 samples have more than two mutations of them. Three genes in module MYC, i.e., NUP153, FAM214A, and MYO6, demonstrate low frequency while covering many kinds of cancers. For example, MYO6 mutate in ten kinds of cancers, while its mutation frequency is only 0.96%. MYO6 has been verified to be an important substance linking miRNA, circRNA, and glucose metabolism in colorectal cancer [29, 44].

The PIK3CA module is most specific for cancer UCEC (Uterine corpus endometrial carcinoma). Four pathways are involved by genes (PIK3CA, PIK3R1, KIT, FGFR1, ARHGEF11) of this module: Rap1 signaling pathway, PI3K-Akt signaling pathway, Pathways in cancer, and Breast cancer. The centering gene PIK3CA has a much coverage in the UCEC samples than in the whole ones, i.e., it mutates in UCEC samples with frequency 50%, while in the whole samples with frequency 19%. It has been verified that the proportion of patients with PIK3CA mutations is very high for cancer UCEC [28]. Furthermore, as a pair of genes having the highest coverage in the module, PIK3CA and PIK3R1 also displays terrific mutual exclusivity. Although they mutate in 596 and 152 samples respectively, the number of samples mutated on both of them is only 18. Genes FGFR1 and ARHGEF11 cover many kinds of cancers while having low mutation frequency, i.e., they mutates in six and eleven types of cancers respectively with about 0.9% mutation frequency.

The EGFR module involves eleven cancer types with GBM (Glioblastoma) being the most specific one, for it possesses the lowest Fisher’s exact test p-value 1.49e\(-\)31. The module contains fourteen genes, eleven of which are engaged in the following seven pathways: FoxO signaling pathway (EGFR, SOS1, IGF1R, IRS1, INSR), MicroRNAs in cancer (EGFR, ERBB2, SOS1, PDGFRB, PDGFRA, IRS1), Rap1 signaling pathway (EGFR, PDGFRB, PDGFRA, IGF1R, INSR, CDH1, TLN1), MAPK signaling pathway (EGFR, ERBB2, PDGFRB, PDGFRA, IGF1R, INSR, SOS1, ERBB4), PI3K-Akt signaling pathway (EGFR, ERBB2, PDGFRB, PDGFRA, IGF1R, INSR, SOS1, ERBB4, IRS1), Pathways in cancer (EGFR, ERBB2, PDGFRB, PDGFRA, IGF1R, SOS1, CDH1), and Prostate cancer (EGFR, ERBB2, SOS1, PDGFRB, IGF1R, PDGFRA). Three of the these pathways have been reported to be closely related to GBM. The alterations in Rap1 signaling pathway are significant in the progression of certain Glioblastoma [23]. MAPK pathway plays an important role in the co-activation of cell proliferation and CREB, which is an essential regulator of cyclin-D1 expression cell in GBM cells [37]. PI3K/AKT signaling has been regarded as one of the most periodically deregulated pathways in glioblastoma, the suppression of it has been acknowledged as a prospective therapeutic target for glioma [43].

The genes in the KAT6A module are primarily engaged in the following two cancer-related pathways: Pathways in cancer (EP300, CREBBP, NFE2L2, NCOA3, RUNX1, KEAP1), Thyroid hormone signaling pathway (EP300, CREBBP, NCOA3). The most specific cancer type of this module is LUSC (Lung Squamous Cell Carcinoma). The centering gene KAT6A exhibits a much higher mutation rate in LUSC samples (about 10.6%) than in the whole samples (about 4.9%). It has been suggested that KAT6A plays an oncogenic role in LUSC [47]. In addition, the EGR1 gene, with the lowest mutation frequency in the module, has been confirmed to play a tumor suppression role for this cancer [62].

The most specific cancer type of the CCNE1 module is OV (Ovarian Cancer). The centering gene CCNE1 shares the highest edge weight with gene FBXW7. They exhibit a high degree of mutual exclusivity, mutating in 145 and 86 patients respectively, and mutating in just 5 patients simultaneously. The amplification of CCNE1 has been demonstrated as a major oncogenic driver in a subset of high-grade serous ovarian cancer [20]. FBXW7 has been identified to inhibit angiogenesis, migration, and invasion of ovarian cancer cells by inhibiting VEGF expression through inactivating \(\beta \)-catenin signaling [39, 66].

The output modules of method ICDM-GEHC were further compared with those produced by the other four methods, where Totalg=100. It is discovered that all of the eight modules produced by the ICDM-GEHC method comprise oncogenes known in COSMIC, while the four comparison approaches generate at least one module that dose not contain any gene in the COSMIC database (8 out of 19 for method Hotnet2, 1 out of 12 for method MEXCOwalk, 2 out of 16 for method ECSWalk, and 1 out of 10 for method HMCEwalk). Methods Hotnet2, MEXCOwalk, ECSWalk, HMCEwalk, and ICDM-GEHC have identified 32, 48, 52, 26, and 52 known oncogenes that were recorded in the COSMIC database, respectively. Among the 100 genes recognized by method ICDM-GEHC, there are 18, 44, 40, and 16 genes that are also identified by methods Hotnet2, MEXCOwalk, ECSWalk, HMCEwalk, respectively. Furthermore, there are 40 genes identified by method ICDM-GEHC being omitted by the other four methods, 14 of these genes have been recorded in the COSMIC database, and 18 of them have been confirmed to be concerned with the development and progression of cancers, or to be engaged in cancer-related pathways (the gene lists are given in Appendix B).

Conclusion

It is both challenging and significant to identify driver modules, for which will contribute to conducting research on cancers. In this study, a novel method ICDM-GEHC was devised. It begins with constructing a weighted PPI network with the help of somatic mutation profiles as well as gene-microRNA networks. The vertices are then manifested with their extracted feature vectors, and are clustered into a set of gene clusters. Eventually, a heuristic process is conducted to produce a group of driver gene modules. Comparison experiments were performed among methods Hotnet2, MEXCOwalk, ECSWalk, HMCEwalk, and ICDM-GEHC by using real biological data. The ICDM-GEHC method exhibits superior performance to the other ones in most cases in terms of the capability of identifying cancer-related genes, producing modules that have relatively high coverage as well as mutual exclusivity, and are significantly involved for specific cancer types. Most genes within the generated modules are engaged in critical cancer-related pathways, or have been verified to be oncogenes or tumor suppressors. Simultaneously, the ICDM-GEHC method actually detected many cancer-related genes that have been omitted by the four comparison methods. The above points of view have been confirmed through a quantity of experiments. Consequently, the ICDM-GEHC method may be regarded as a helpful supplemental tool for recognizing cancer driver modules.

Although the ICDM-GEHC method presents good identification performance by applying advanced machine learning techniques into multi-omics data, it does have some notable limitations. In this method, only somatic mutation data is adopted, other genetic aberrations such as epigenetic changes, copy number variations, translocations, and fusions can be considered in an extended version of the method. In addition, since the PPI networks may vary across different cell types, tissue types, environmental conditions, and time points, the dynamic network should be adopted to replace the static one, so as to increase the flexibility and reliability of the method. In the course of experiments, it is also discovered that a high computational cost is incurred, future efforts should be devoted to further enhance its efficiency through optimizing parameters, simplifying the algorithm, and improving the module refinement.