Data Mining for Systems Biology pp 1-8 | Cite as

# Dense Module Enumeration in Biological Networks

## Abstract

Automatic discovery of functional complexes from protein interaction data is a rewarding but challenging problem. While previous approaches use approximations to extract dense modules, our approach exactly solves the problem of dense module enumeration. Furthermore, constraints from additional information sources such as gene expression and phenotype data can be integrated, so we can systematically detect dense modules with interesting profiles. Given a weighted protein interaction network, our method discovers all protein sets that satisfy a user-defined minimum density threshold. We employ a reverse search strategy, which allows us to exploit the density criterion in an efficient way.

### Key words

Protein complex Dense module enumeration Reverse search Gene expression Protein interaction## 1 Introduction

Today, a large number of databases provide access to experimentally observed protein–protein interactions. The analysis of the corresponding protein interaction networks can be useful for functional annotation of previously uncharacterized genes as well as for revealing additional functionality of known genes. Often, function prediction involves an intermediate step where clusters of densely interacting proteins, called modules, are extracted from the network; the dense subgraphs are likely to represent functional protein complexes (1). However, the experimental methods are not always reliable, which means that the interaction network may contain false positive edges. Therefore, confidence weights of interactions should be taken into account.

A natural criterion that combines these two aspects is the average pairwise interaction weight within a module (assuming a weight of zero for unobserved interactions, cf. (2)). We call this the module *density*, in analogy to unweighted networks (3). We present a method to enumerate all modules that exceed a given density threshold. It solves the problem efficiently via a simple and elegant reverse search algorithm, extending the unweighted network approach in (4).

In recent years, many module finding approaches which integrate protein–protein interaction networks with other gene-related data have been published. One strategy, often used in the context of partitioning methods, is to build a new network whose edge weights are determined by multiple data sources (12). Tanay et al. (13) also create one single network to analyze multiple genomic data at once; however, they use a bipartite network where each edge corresponds to one data type only. In both cases, the different data sets have to be normalized appropriately before they can be integrated. In contrast to that, other approaches keep the data sources separate and define individual constraints for each of them. Consequently, arbitrarily many data sets can be jointly analyzed without the need to take care of appropriate scaling or normalization. Within this class of approaches, there exist two main strategies to deal with profile data like gene expression measurements. In the first case, the profile information is transformed into a gene similarity network, where the strength of a link between two genes represents the global similarity of their profiles (2, 14, 15). In the second case, the condition-specific information is kept to perform a context-dependent module analysis (16, 17, 18). Our approach follows along this line, searching for modules in the protein interaction network that have consistent profiles with respect to a subset of conditions. In contrast to the previous methods, our algorithm systematically identifies all modules satisfying a density criterion and optional consistency constraints.

## 2 Materials

## 3 Methods

### 3.1 Enumeration of Dense Modules

We would like to enumerate all subsets *U* with \( \rho (U) \ge \theta \), where *θ* is a prespecified constant.

*anti-monotonic*with respect to the density. Namely, the density is not monotonically decreasing when the tree is traversed from the root to a leaf. This property disallows early pruning and makes the enumeration difficult. However, there exists indeed a search tree which is anti-monotonic (Fig. 3c). It can be constructed by reverse search.

In reverse search, the search tree is specified by defining a *reduction map*\( f(U) \) which transforms a child to its parent. In our case, the parent is created by removing the node with minimum degree from the child. Here, the degree of a node is defined as the sum of weights of all adjacent edges within *U*. If there are multiple nodes with minimum degree, the one with the smallest index is removed. It is proven that the density of a parent is at least as high as the maximum density among the children, ensuring that the search tree induced by the reduction map is anti-monotonic.

In addition to the anti-monotonicity property, a valid reduction map must satisfy the following reachability condition (23): starting from any node of the search tree, we can reach a root node after applying the reduction map a finite number of times. This condition ensures that the induced search tree is indeed spanning. For the reduction map stated above, it is trivial to show that the reachability condition is satisfied, because any cluster shrinks to the empty set by removing nodes repeatedly.

To enumerate all clusters with density \( \ge \theta \), one has to traverse this implicitly defined search tree in a depth-first or a breadth-first manner. During traversal, children are generated *on demand*. As the reduction map defines how to get from children to parents and not vice versa, we cannot directly derive the children from a given parent. Instead, to generate the children of a cluster *U*, we have to consider all candidates \( U \cup \{ i\}, \;i\ \notin U \) and apply the reduction map to every candidate (reverse search principle). Qualified candidates with \( f(U \cup \{ i\} ) = U \) are then taken as children. A naive implementation of this child generation process can make the algorithm very slow. Thus, it is important to engineer this process well. As the search tree is anti-monotonic, one can prune the tree whenever the density goes below *θ*.

The definition of a search tree is not an issue in the context of frequent pattern mining (24), because frequency is anti-monotonic in any tree. Reverse search is interesting because it provides a systematic way of defining an anti-monotonic tree. Notice, however, that it is not applicable to all score functions. Cluster density is an example where reverse search can be applied most effectively.

### 3.2 Integration of Additional Constraints

The DME framework makes it easy to incorporate and systematically exploit constraints from additional data sources. For illustration, consider the case where we have an additional data set which provides profiles of proteins or genes across different conditions (Fig. 1.1b). For simplicity, let us assume binary profiles being 1 if the protein is positively associated with the corresponding condition, and 0 otherwise. Then, dense modules where all member proteins share the same profile across a certain number of conditions are of particular interest; we call these modules *consistent*. The problem of DME with consistency constraints is formalized as follows.

### Definition 1

Given a graph with node set V and weight matrix W, a density threshold \( \theta > 0 \), a profile matrix \( {{({{m}_{{ij}}})}_{{i \in V,j \in C}}} \), and nonnegative integers \( {{n}_0} \) and \( {{n}_1} \), find all modules \( U \subset V \) with \( {{\rho }_W}(U) \ge \theta \) s.t. there exist at least \( {{n}_0} \) conditions \( c \in C \) with \( {{m}_{{uc}}} = 0,\quad \forall u \in U \) and there exist at least \( {{n}_1} \)\( c \in C \) with \( {{m}_{{uc}}} = 1,\forall u \in U \).

Given such a *consistency constraint*, we can stop the module extension during the dense module mining as soon as the constraint is violated. This is due to the fact that the number of consistent profile conditions cannot increase while extending the module; more generally, this property is called *anti-monotonicity*. So we simply add to the module enumeration algorithm a condition which checks for the consistency requirements. These are then automatically taken into account in the check for local maximality. The use of additional constraints can restrict the search space considerably, so it accelerates the computation and helps to focus on biologically interesting solutions.

We have described a method for enumerating dense modules in a network. Methodological details and experimental results are available in (25). Our framework can be extended to module detection from multiple networks. see ref. 26 for detailed explanation.

## 4 Notes

- 1.
If one starts from a low density threshold, our algorithm often takes too much time. One should start from very large threshold first, and gradually reduce the threshold to meet one’s requirement.

### References

- 1.Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol 3:88PubMedCrossRefGoogle Scholar
- 2.Ulitsky I, Shamir R (2007) Identification of functional modules using network topology and high-throughput data. BMC Syst Biol 1:8PubMedCrossRefGoogle Scholar
- 3.Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2PubMedCrossRefGoogle Scholar
- 4.Uno T (2007) An efficient algorithm for enumerating pseudo cliques. In: Proceedings of ISAAC 2007, pp. 402–414Google Scholar
- 5.Chen J, Yuan B (2006) Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics 22(18):2283–2290PubMedCrossRefGoogle Scholar
- 6.van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of UtrechtGoogle Scholar
- 7.Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582PubMedCrossRefGoogle Scholar
- 8.Everett L, Wang LS, Hannenhalli S (2006) Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics 22(14):e117–e123PubMedCrossRefGoogle Scholar
- 9.Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818PubMedCrossRefGoogle Scholar
- 10.Spirin V, Mirny LA (2003) Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 100(21):12123–12128PubMedCrossRefGoogle Scholar
- 11.Zeng Z, Wang J, Zhou L, Karypis G (2006) Coherent closed quasi-clique discovery from large dense graph databases. KDD '06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 797–802CrossRefGoogle Scholar
- 12.Hanisch D, Zien A, Zimmer R, Lengauer T (2002) Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl 1):S145–S154PubMedCrossRefGoogle Scholar
- 13.Tanay A, Sharan R, Kupiec M, Shamir R (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101(9):2981–2986PubMedCrossRefGoogle Scholar
- 14.Segal E, Wang H, Koller D (2003) Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19(suppl 1):i264–i271PubMedCrossRefGoogle Scholar
- 15.Pei J, Jiang D, Zhang A (2005) Mining cross-graph quasi-cliques in gene expression and protein interaction data. ICDE '05: proceedings of the 21st international conference on data engineering (ICDE'05). IEEE Computer Society, Washington, DC, pp 353–354Google Scholar
- 16.Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(suppl 1):S233–S240PubMedCrossRefGoogle Scholar
- 17.Huang Y, Li H, Hu H, Yan X, Waterman MS, Huang H, Zhou XJ (2007) Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics 23(13):i222–i229PubMedCrossRefGoogle Scholar
- 18.Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ (2007) A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics 23(13):i577–i586PubMedCrossRefGoogle Scholar
- 19.Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32(suppl 1):D452–D455PubMedCrossRefGoogle Scholar
- 20.Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35(suppl 1):D572–D574PubMedCrossRefGoogle Scholar
- 21.Bader GD, Betel D, Hogue CWV (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31(1):248–250PubMedCrossRefGoogle Scholar
- 22.Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101(16):6062–6067PubMedCrossRefGoogle Scholar
- 23.Avis D, Fukuda K (1996) Reverse search for enumeration. Discrete Appl Math 65:21–46CrossRefGoogle Scholar
- 24.Han J, Kamber M (2006) Data mining: concepts and techniques of the Morgan Kaufmann series in data management systems, 2nd edn. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
- 25.Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K (2009) Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics 25:933–940PubMedCrossRefGoogle Scholar
- 26.Georgii E, Tsuda K, Schölkopf B (2011) Multi-way set enumeration in weight tensors. Mach Learn 82:123–155CrossRefGoogle Scholar