Abstract
The ability to find and count subgraphs of a given network is an important non trivial task with multidisciplinary applicability. Discovering network motifs or computing graphlet signatures are two examples of methodologies that at their core rely precisely on the subgraph counting problem. Here we present the g-trie, a data-structure specifically designed for discovering subgraph frequencies. We produce a tree that encapsulates the structure of the entire graph set, taking advantage of common topologies in the same way a prefix tree takes advantage of common prefixes. This avoids redundancy in the representation of the graphs, thus allowing for both memory and computation time savings. We introduce a specialized canonical labeling designed to highlight common substructures and annotate the g-trie with a set of conditional rules that break symmetries, avoiding repetitions in the computation. We introduce a novel algorithm that takes as input a set of small graphs and is able to efficiently find and count them as induced subgraphs of a larger network. We perform an extensive empirical evaluation of our algorithms, focusing on efficiency and scalability on a set of diversified complex networks. Results show that g-tries are able to clearly outperform previously existing algorithms by at least one order of magnitude.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
G-Tries source code is available at: http://www.dcc.fc.up.pt/gtries/
Fanmod is available at http://theinf1.informatik.uni-jena.de/~wernicke/motifs/.
Kavosh source code is available in http://lbb.ut.ac.ir/Download/LBBsoft/Kavosh/.
The source code is available at http://sites.google.com/site/andrealancichinetti/files.
References
Adamic LA, Glance N (2005) The political blogosphere and the 2004 U.S. election: divided they blog. In: 3rd International workshop on link discovery (LinkKDD). ACM, New York, pp 36–43
Albert I, Albert R (2004) Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 20(18):3346–3352
Albert R, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Modern Phys 74(1):47–97. doi:10.1103/RevModPhys.74.47
Arenas A (2011) Network data sets. http://deim.urv.cat/aarenas/data/welcome.htm
Batagelj V, Mrvar A (2006) Pajek datasets. http://vlado.fmf.uni-lj.si/pub/networks/data/
Borgelt C, Berthold MR (2002) Mining molecular fragments: finding relevant substructures of molecules. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC
Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 31(9):2443–2450
Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in twitter: the million follower fallacy. In: 4th International AAAI conference on weblogs and social media (ICWSM)
Chen J, Hsu W, Lee ML, Ng SK (2006) Nemofinder: dissecting genome-wide protein–protein interactions with meso-scale network motifs. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 106–115
Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Briefings Funct Genomics 7(2):147–156
Cook SA (1971) The complexity of theorem-proving procedures. In: 3rd Annual ACM symposium on theory of computing, STOC ’71. ACM, New York, pp 151–158
da Costa LF, Rodrigues FA, Travieso G, Boas PRV (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56:167
Duch J, Arenas A (2005) Community detection in complex networks using extremal optimization. Phys Rev E (Stat Nonlinear Soft Matter Phys) 72:027,104
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Grochow J, Kellis M (2007) Network motif discovery using subgraph enumeration and symmetry-breaking. Res Comput Mol Biol 92–106
Howe D (2010) Foldoc, free online dictionary of computing. http://foldoc.org/
Huan J, Bandyopadhyay D, Prins J, Snoeyink J, Tropsha A, Wang W (2006) Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. In: IEEE Symposium on computational intelligence in bioinformatics and computational biology (CIBCB)
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: 3rd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 549
Kärkkäinen L (2008) Yet another java vs. c++ shootout. http://zi.fi/shootout/
Kashani Z, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari E, Asadi S, Mohammadi S, Schreiber F, Masoudi-Nejad A (2009) Kavosh: a new algorithm for finding network motifs. BMC Bioinform 10(1):318
Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Köbler J, Schöning U, Torán J (1993) The graph isomorphism problem: its structural complexity (Progress in Theoretical Computer Science). Birkhauser Verlag, Basel
Lacroix V, Fernandes CG, Sagot MF (2006) Motif search in graphs: application to metabolic networks. IEEE/ACM Trans Comput Biol Bioinform 3(4):360–368
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E (Stat Nonlinear Soft Matter Phys) 78(4):046,110
Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behav Ecol Sociobiol 54(4):396–405
McKay B (1981) Practical graph isomorphism. Congressus Numerantium 30:45–87
McKay B (1998) Isomorph-free exhaustive generation. J Algorithms 26(2):306–324
Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of evolved and designed networks. Science 303(5663):1538–1542
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42:89–100
Newman M (2009) Network data. http://www-personal.umich.edu/mejn/netdata/
Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256. doi:10.1137/S003614450342480
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E (Stat Nonlinear Soft Matter Phys) 74(3):036,104
Nijssen S, Kok JN (2004) Frequent graph mining and its application to molecular databases. In: 2004 IEEE International conference on systems, man and cybernetics, vol 5. doi:10.1109/ICSMC.2004.1401252
Norlen K, Lucas G, Gebbie M, Chuang J (2002) EVA: extraction, visualization and analysis of the telecommunications and media ownership network. In: International telecommunications society 14th biennial conference (ITS). International Telecommunications Society, Seoul
Omidi S, Schreiber F, Masoudi-Nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genetic Syst 84(5):385–395
Pasquier N, Bastide Y, Taouil R, Lakhal L. (1999) Discovering frequent closed itemsets for association rules. In: ICDT ’99: 7th international conference on database theory. Springer, London, pp 398–416
Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23:e177–e183
Reitz J (2002) Odlis: online dictionary of library and information science. http://vlado.fmf.uni-lj.si/pub/networks/data/dic/odlis/odlis.pdf
Ribeiro P, Silva F (2010) Efficient subgraph frequency estimation with g-tries. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 6293, pp 238–249
Ribeiro P, Silva F (2010) G-tries: n efficient data structure for discovering network motifs. In: 25th ACM symposium on applied computing (SAC). ACM, pp 1559–1566
Ribeiro P, Silva F (2012) Querying subgraph sets with g-tries. In: 2nd ACM SIGMOD workshop on databases and social networks. ACM 25–30. doi:10.1145/2304536.2304541.
Ribeiro P, Silva F, Kaiser M (2009) Strategies for network motifs discovery. In: 5th IEEE international conference on e-science. IEEE Computer Society Press, Oxford, pp 80–87
Ribeiro P, Silva F, Lopes L (2010) Efficient parallel subgraph counting using g-tries. In: IEEE International conference on cluster computing (Cluster). IEEE Computer Society Press, pp 1559–1566
Ribeiro P, Silva F, Lopes L (2012) Parallel discovery of network motifs. J Parallel Distrib Comput 72:144–154
Schreiber F, Schwobbermeyer H (2004) Towards motif detection in networks: frequency concepts and flexible search. In: International workshop on network tools and applications in biology (NETTAB), pp 91–102
Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68
Sporns O, Kotter R (2004) Motifs in brain networks. PLoS Biol 2(11):e369. doi:10.1371/journal.pbio.0020369
Tarjan R (1971) Depth-first search and linear graph algorithms. In: Annual IEEE symposium on foundations of computer science. IEEE Computer Society, Los Alamitos, pp 114–121
Valverde S, Solé RV (2005) Network motifs in computational graphs: A case study in software architecture. Phys Rev E 72(2), 026107. doi:10.1103/PhysRevE.72.026107
Wang C, Parthasarathy S (2004) Parallel algorithms for mining frequent structural motifs in scientific data. In: ACM International conference on supercomputing (ICS)
Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442
Wernicke S (2005) A faster algorithm for detecting network motifs. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 3692, pp. 165–177
Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359
White JG, Southgate E, Thomson JN, Brenner S (1986) The structure of the nervous system of the Nematode Caenorhabditis elegans. Philos Trans R Soc London B Biol Sci 314(1165):1–340
Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 721
Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp 335–346
Yuan D, Mitra P (2011) A lattice-based graph index for subgraph search. In: 14th International workshop on the web and databases (WebDB)
Acknowledgments
We thank the reviewers for their constructive and helpful suggestions, which helped in improving the quality of this manuscript. Pedro Ribeiro is funded by an FCT Research Grant (SFRH/BPD/81695/2011).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M. J. Zaki.
Rights and permissions
About this article
Cite this article
Ribeiro, P., Silva, F. G-Tries: a data structure for storing and finding subgraphs. Data Min Knowl Disc 28, 337–377 (2014). https://doi.org/10.1007/s10618-013-0303-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0303-4