Dense Module Enumeration in Biological Networks

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 939)

Abstract

Automatic discovery of functional complexes from protein interaction data is a rewarding but challenging problem. While previous approaches use approximations to extract dense modules, our approach exactly solves the problem of dense module enumeration. Furthermore, constraints from additional information sources such as gene expression and phenotype data can be integrated, so we can systematically detect dense modules with interesting profiles. Given a weighted protein interaction network, our method discovers all protein sets that satisfy a user-defined minimum density threshold. We employ a reverse search strategy, which allows us to exploit the density criterion in an efficient way.

Key words

Protein complex Dense module enumeration Reverse search Gene expression Protein interaction 

1 Introduction

Today, a large number of databases provide access to experimentally observed protein–protein interactions. The analysis of the corresponding protein interaction networks can be useful for functional annotation of previously uncharacterized genes as well as for revealing additional functionality of known genes. Often, function prediction involves an intermediate step where clusters of densely interacting proteins, called modules, are extracted from the network; the dense subgraphs are likely to represent functional protein complexes (1). However, the experimental methods are not always reliable, which means that the interaction network may contain false positive edges. Therefore, confidence weights of interactions should be taken into account.

A natural criterion that combines these two aspects is the average pairwise interaction weight within a module (assuming a weight of zero for unobserved interactions, cf. (2)). We call this the module density, in analogy to unweighted networks (3). We present a method to enumerate all modules that exceed a given density threshold. It solves the problem efficiently via a simple and elegant reverse search algorithm, extending the unweighted network approach in (4).

There is a large variety of related work on module discovery in networks. The most common group are graph partitioning methods (5, 6, 7). They divide the network into a set of modules, so their approach is substantially different from dense module enumeration (DME), which provides an explicit density criterion for modules (Fig. 1a). Another group of methods define explicit module criteria, but employ heuristic search techniques to find the modules (3, 8). This contrasts with complete enumeration algorithms, which form the third line of research: they give explicit criteria and return all modules that satisfy them. For example, clique search has been frequently applied (9, 10). The enumeration of cliques can be considered as a special case of our approach, restricting it to unweighted graphs and a density threshold of one. Further enumerative approaches use different module criteria assuming unweighted graphs (11).
Fig. 1.

Dense module enumeration approach. (a) DME versus partitioning. While partitioning methods return one clustering of the network, DME discovers all modules that satisfy a minimum density threshold. (b) Combination with profile data. Integration of protein–protein interaction (PPI) and external profile data allows to focus on modules with consistent behavior of all member proteins in a subset of conditions. The top module has two conditions where all nodes are positive and one condition where all nodes are negative. The arrows in the profile show such consistent conditions. On the other hand, the bottom module does not have such consistency.

In recent years, many module finding approaches which integrate protein–protein interaction networks with other gene-related data have been published. One strategy, often used in the context of partitioning methods, is to build a new network whose edge weights are determined by multiple data sources (12). Tanay et al. (13) also create one single network to analyze multiple genomic data at once; however, they use a bipartite network where each edge corresponds to one data type only. In both cases, the different data sets have to be normalized appropriately before they can be integrated. In contrast to that, other approaches keep the data sources separate and define individual constraints for each of them. Consequently, arbitrarily many data sets can be jointly analyzed without the need to take care of appropriate scaling or normalization. Within this class of approaches, there exist two main strategies to deal with profile data like gene expression measurements. In the first case, the profile information is transformed into a gene similarity network, where the strength of a link between two genes represents the global similarity of their profiles (2, 14, 15). In the second case, the condition-specific information is kept to perform a context-dependent module analysis (16, 17, 18). Our approach follows along this line, searching for modules in the protein interaction network that have consistent profiles with respect to a subset of conditions. In contrast to the previous methods, our algorithm systematically identifies all modules satisfying a density criterion and optional consistency constraints.

2 Materials

  1. 1.

    A protein interaction network: It can be downloaded, e.g., from the following Web sites, IntAct (19), MINT (20), and BIND (21).

     
  2. 2.

    Gene expression data: For example, global human gene expression profiles across different tissues can be obtained from the supplementary information of (22).

     

3 Methods

We describe the basic idea of DME using the examplar graph shown in Fig. 2. First, we discuss how to enumerate dense modules in a network, and then proceed to explain how gene expression data can be involved.
Fig. 2.

An examplar graph for dense module enumeration.

3.1 Enumeration of Dense Modules

Our method is based on the reverse search paradigm (23), which is quite popular in the algorithm community, but only in a limited degree known in the data mining community. A weighted graph is represented as a symmetric association matrix (edges that are not shown have zero weight). We denote by \( {{w}_{{ij}}} \) the weight between two nodes, and define the density of a node subset \( U \) as
$$ \rho (U) = \sum\limits_{{i,j \in U,i < j}} {{{{{{w}_{{ij}}}}} \left/ {{\frac{{|U|(|U| - 1)}}{2}}} \right.}}. $$

We would like to enumerate all subsets U with \( \rho (U) \ge \theta \), where θ is a prespecified constant.

All subsets form a natural graph-shaped search space, where one can move downwards or upwards by adding or removing a node, respectively (Fig. 3a). Here, the root node corresponds to the empty set. For efficient traversal, however, one needs a spanning tree, not a graph. When a tree is made by lexicographical ordering (Fig. 3b), the search tree is not anti-monotonic with respect to the density. Namely, the density is not monotonically decreasing when the tree is traversed from the root to a leaf. This property disallows early pruning and makes the enumeration difficult. However, there exists indeed a search tree which is anti-monotonic (Fig. 3c). It can be constructed by reverse search.
Fig. 3.

Illustration of reverse search.

In reverse search, the search tree is specified by defining a reduction map\( f(U) \) which transforms a child to its parent. In our case, the parent is created by removing the node with minimum degree from the child. Here, the degree of a node is defined as the sum of weights of all adjacent edges within U. If there are multiple nodes with minimum degree, the one with the smallest index is removed. It is proven that the density of a parent is at least as high as the maximum density among the children, ensuring that the search tree induced by the reduction map is anti-monotonic.

In addition to the anti-monotonicity property, a valid reduction map must satisfy the following reachability condition (23): starting from any node of the search tree, we can reach a root node after applying the reduction map a finite number of times. This condition ensures that the induced search tree is indeed spanning. For the reduction map stated above, it is trivial to show that the reachability condition is satisfied, because any cluster shrinks to the empty set by removing nodes repeatedly.

To enumerate all clusters with density \( \ge \theta \), one has to traverse this implicitly defined search tree in a depth-first or a breadth-first manner. During traversal, children are generated on demand. As the reduction map defines how to get from children to parents and not vice versa, we cannot directly derive the children from a given parent. Instead, to generate the children of a cluster U, we have to consider all candidates \( U \cup \{ i\}, \;i\ \notin U \) and apply the reduction map to every candidate (reverse search principle). Qualified candidates with \( f(U \cup \{ i\} ) = U \) are then taken as children. A naive implementation of this child generation process can make the algorithm very slow. Thus, it is important to engineer this process well. As the search tree is anti-monotonic, one can prune the tree whenever the density goes below θ.

The definition of a search tree is not an issue in the context of frequent pattern mining (24), because frequency is anti-monotonic in any tree. Reverse search is interesting because it provides a systematic way of defining an anti-monotonic tree. Notice, however, that it is not applicable to all score functions. Cluster density is an example where reverse search can be applied most effectively.

3.2 Integration of Additional Constraints

The DME framework makes it easy to incorporate and systematically exploit constraints from additional data sources. For illustration, consider the case where we have an additional data set which provides profiles of proteins or genes across different conditions (Fig. 1.1b). For simplicity, let us assume binary profiles being 1 if the protein is positively associated with the corresponding condition, and 0 otherwise. Then, dense modules where all member proteins share the same profile across a certain number of conditions are of particular interest; we call these modules consistent. The problem of DME with consistency constraints is formalized as follows.

Definition 1

Given a graph with node set V and weight matrix W, a density threshold \( \theta > 0 \), a profile matrix \( {{({{m}_{{ij}}})}_{{i \in V,j \in C}}} \), and nonnegative integers \( {{n}_0} \) and \( {{n}_1} \), find all modules \( U \subset V \) with \( {{\rho }_W}(U) \ge \theta \) s.t. there exist at least \( {{n}_0} \) conditions \( c \in C \) with \( {{m}_{{uc}}} = 0,\quad \forall u \in U \) and there exist at least \( {{n}_1} \)\( c \in C \) with \( {{m}_{{uc}}} = 1,\forall u \in U \).

Given such a consistency constraint, we can stop the module extension during the dense module mining as soon as the constraint is violated. This is due to the fact that the number of consistent profile conditions cannot increase while extending the module; more generally, this property is called anti-monotonicity. So we simply add to the module enumeration algorithm a condition which checks for the consistency requirements. These are then automatically taken into account in the check for local maximality. The use of additional constraints can restrict the search space considerably, so it accelerates the computation and helps to focus on biologically interesting solutions.

We have described a method for enumerating dense modules in a network. Methodological details and experimental results are available in (25). Our framework can be extended to module detection from multiple networks. see ref. 26 for detailed explanation.

4 Notes

  1. 1.

    If one starts from a low density threshold, our algorithm often takes too much time. One should start from very large threshold first, and gradually reduce the threshold to meet one’s requirement.

     

References

  1. 1.
    Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol 3:88PubMedCrossRefGoogle Scholar
  2. 2.
    Ulitsky I, Shamir R (2007) Identification of functional modules using network topology and high-throughput data. BMC Syst Biol 1:8PubMedCrossRefGoogle Scholar
  3. 3.
    Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2PubMedCrossRefGoogle Scholar
  4. 4.
    Uno T (2007) An efficient algorithm for enumerating pseudo cliques. In: Proceedings of ISAAC 2007, pp. 402–414Google Scholar
  5. 5.
    Chen J, Yuan B (2006) Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics 22(18):2283–2290PubMedCrossRefGoogle Scholar
  6. 6.
    van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of UtrechtGoogle Scholar
  7. 7.
    Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582PubMedCrossRefGoogle Scholar
  8. 8.
    Everett L, Wang LS, Hannenhalli S (2006) Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics 22(14):e117–e123PubMedCrossRefGoogle Scholar
  9. 9.
    Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818PubMedCrossRefGoogle Scholar
  10. 10.
    Spirin V, Mirny LA (2003) Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 100(21):12123–12128PubMedCrossRefGoogle Scholar
  11. 11.
    Zeng Z, Wang J, Zhou L, Karypis G (2006) Coherent closed quasi-clique discovery from large dense graph databases. KDD '06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 797–802CrossRefGoogle Scholar
  12. 12.
    Hanisch D, Zien A, Zimmer R, Lengauer T (2002) Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl 1):S145–S154PubMedCrossRefGoogle Scholar
  13. 13.
    Tanay A, Sharan R, Kupiec M, Shamir R (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101(9):2981–2986PubMedCrossRefGoogle Scholar
  14. 14.
    Segal E, Wang H, Koller D (2003) Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19(suppl 1):i264–i271PubMedCrossRefGoogle Scholar
  15. 15.
    Pei J, Jiang D, Zhang A (2005) Mining cross-graph quasi-cliques in gene expression and protein interaction data. ICDE '05: proceedings of the 21st international conference on data engineering (ICDE'05). IEEE Computer Society, Washington, DC, pp 353–354Google Scholar
  16. 16.
    Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(suppl 1):S233–S240PubMedCrossRefGoogle Scholar
  17. 17.
    Huang Y, Li H, Hu H, Yan X, Waterman MS, Huang H, Zhou XJ (2007) Systematic discovery of functional modules and context-specific functional annotation of human genome. Bioinformatics 23(13):i222–i229PubMedCrossRefGoogle Scholar
  18. 18.
    Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ (2007) A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics 23(13):i577–i586PubMedCrossRefGoogle Scholar
  19. 19.
    Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32(suppl 1):D452–D455PubMedCrossRefGoogle Scholar
  20. 20.
    Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35(suppl 1):D572–D574PubMedCrossRefGoogle Scholar
  21. 21.
    Bader GD, Betel D, Hogue CWV (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31(1):248–250PubMedCrossRefGoogle Scholar
  22. 22.
    Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101(16):6062–6067PubMedCrossRefGoogle Scholar
  23. 23.
    Avis D, Fukuda K (1996) Reverse search for enumeration. Discrete Appl Math 65:21–46CrossRefGoogle Scholar
  24. 24.
    Han J, Kamber M (2006) Data mining: concepts and techniques of the Morgan Kaufmann series in data management systems, 2nd edn. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
  25. 25.
    Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K (2009) Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics 25:933–940PubMedCrossRefGoogle Scholar
  26. 26.
    Georgii E, Tsuda K, Schölkopf B (2011) Multi-way set enumeration in weight tensors. Mach Learn 82:123–155CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.AIST Computational Biology Research CenterTokyoJapan
  2. 2.JST ERATO Minato ProjectSapporoJapan
  3. 3.School of ScienceHelsinki Institute for Information Technology HIIT Aalto UniversityAaltoFinland

Personalised recommendations