Data Mining and Knowledge Discovery

, Volume 28, Issue 2, pp 337–377 | Cite as

G-Tries: a data structure for storing and finding subgraphs

Article

Abstract

The ability to find and count subgraphs of a given network is an important non trivial task with multidisciplinary applicability. Discovering network motifs or computing graphlet signatures are two examples of methodologies that at their core rely precisely on the subgraph counting problem. Here we present the g-trie, a data-structure specifically designed for discovering subgraph frequencies. We produce a tree that encapsulates the structure of the entire graph set, taking advantage of common topologies in the same way a prefix tree takes advantage of common prefixes. This avoids redundancy in the representation of the graphs, thus allowing for both memory and computation time savings. We introduce a specialized canonical labeling designed to highlight common substructures and annotate the g-trie with a set of conditional rules that break symmetries, avoiding repetitions in the computation. We introduce a novel algorithm that takes as input a set of small graphs and is able to efficiently find and count them as induced subgraphs of a larger network. We perform an extensive empirical evaluation of our algorithms, focusing on efficiency and scalability on a set of diversified complex networks. Results show that g-tries are able to clearly outperform previously existing algorithms by at least one order of magnitude.

Keywords

Complex networks Subgraphs Data structures Trees  Network motifs Graphlets 

References

  1. Adamic LA, Glance N (2005) The political blogosphere and the 2004 U.S. election: divided they blog. In: 3rd International workshop on link discovery (LinkKDD). ACM, New York, pp 36–43Google Scholar
  2. Albert I, Albert R (2004) Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 20(18):3346–3352CrossRefGoogle Scholar
  3. Albert R, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Modern Phys 74(1):47–97. doi:10.1103/RevModPhys.74.47 Google Scholar
  4. Arenas A (2011) Network data sets. http://deim.urv.cat/aarenas/data/welcome.htm
  5. Batagelj V, Mrvar A (2006) Pajek datasets. http://vlado.fmf.uni-lj.si/pub/networks/data/
  6. Borgelt C, Berthold MR (2002) Mining molecular fragments: finding relevant substructures of molecules. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DCGoogle Scholar
  7. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 31(9):2443–2450CrossRefGoogle Scholar
  8. Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in twitter: the million follower fallacy. In: 4th International AAAI conference on weblogs and social media (ICWSM)Google Scholar
  9. Chen J, Hsu W, Lee ML, Ng SK (2006) Nemofinder: dissecting genome-wide protein–protein interactions with meso-scale network motifs. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 106–115Google Scholar
  10. Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Briefings Funct Genomics 7(2):147–156CrossRefGoogle Scholar
  11. Cook SA (1971) The complexity of theorem-proving procedures. In: 3rd Annual ACM symposium on theory of computing, STOC ’71. ACM, New York, pp 151–158Google Scholar
  12. da Costa LF, Rodrigues FA, Travieso G, Boas PRV (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56:167Google Scholar
  13. Duch J, Arenas A (2005) Community detection in complex networks using extremal optimization. Phys Rev E (Stat Nonlinear Soft Matter Phys) 72:027,104Google Scholar
  14. Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499CrossRefGoogle Scholar
  15. Grochow J, Kellis M (2007) Network motif discovery using subgraph enumeration and symmetry-breaking. Res Comput Mol Biol 92–106Google Scholar
  16. Howe D (2010) Foldoc, free online dictionary of computing. http://foldoc.org/
  17. Huan J, Bandyopadhyay D, Prins J, Snoeyink J, Tropsha A, Wang W (2006) Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. In: IEEE Symposium on computational intelligence in bioinformatics and computational biology (CIBCB)Google Scholar
  18. Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: 3rd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 549Google Scholar
  19. Kärkkäinen L (2008) Yet another java vs. c++ shootout. http://zi.fi/shootout/
  20. Kashani Z, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari E, Asadi S, Mohammadi S, Schreiber F, Masoudi-Nejad A (2009) Kavosh: a new algorithm for finding network motifs. BMC Bioinform 10(1):318CrossRefGoogle Scholar
  21. Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758CrossRefGoogle Scholar
  22. Köbler J, Schöning U, Torán J (1993) The graph isomorphism problem: its structural complexity (Progress in Theoretical Computer Science). Birkhauser Verlag, BaselCrossRefMATHGoogle Scholar
  23. Lacroix V, Fernandes CG, Sagot MF (2006) Motif search in graphs: application to metabolic networks. IEEE/ACM Trans Comput Biol Bioinform 3(4):360–368Google Scholar
  24. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E (Stat Nonlinear Soft Matter Phys) 78(4):046,110Google Scholar
  25. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behav Ecol Sociobiol 54(4):396–405CrossRefGoogle Scholar
  26. McKay B (1981) Practical graph isomorphism. Congressus Numerantium 30:45–87MathSciNetGoogle Scholar
  27. McKay B (1998) Isomorph-free exhaustive generation. J Algorithms 26(2):306–324CrossRefMATHMathSciNetGoogle Scholar
  28. Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of evolved and designed networks. Science 303(5663):1538–1542CrossRefGoogle Scholar
  29. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827CrossRefGoogle Scholar
  30. Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42:89–100CrossRefGoogle Scholar
  31. Newman M (2009) Network data. http://www-personal.umich.edu/mejn/netdata/
  32. Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256. doi:10.1137/S003614450342480 Google Scholar
  33. Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E (Stat Nonlinear Soft Matter Phys) 74(3):036,104Google Scholar
  34. Nijssen S, Kok JN (2004) Frequent graph mining and its application to molecular databases. In: 2004 IEEE International conference on systems, man and cybernetics, vol 5. doi:10.1109/ICSMC.2004.1401252
  35. Norlen K, Lucas G, Gebbie M, Chuang J (2002) EVA: extraction, visualization and analysis of the telecommunications and media ownership network. In: International telecommunications society 14th biennial conference (ITS). International Telecommunications Society, SeoulGoogle Scholar
  36. Omidi S, Schreiber F, Masoudi-Nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genetic Syst 84(5):385–395CrossRefGoogle Scholar
  37. Pasquier N, Bastide Y, Taouil R, Lakhal L. (1999) Discovering frequent closed itemsets for association rules. In: ICDT ’99: 7th international conference on database theory. Springer, London, pp 398–416Google Scholar
  38. Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23:e177–e183CrossRefGoogle Scholar
  39. Reitz J (2002) Odlis: online dictionary of library and information science. http://vlado.fmf.uni-lj.si/pub/networks/data/dic/odlis/odlis.pdf
  40. Ribeiro P, Silva F (2010) Efficient subgraph frequency estimation with g-tries. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 6293, pp 238–249Google Scholar
  41. Ribeiro P, Silva F (2010) G-tries: n efficient data structure for discovering network motifs. In: 25th ACM symposium on applied computing (SAC). ACM, pp 1559–1566Google Scholar
  42. Ribeiro P, Silva F (2012) Querying subgraph sets with g-tries. In: 2nd ACM SIGMOD workshop on databases and social networks. ACM 25–30. doi:10.1145/2304536.2304541.
  43. Ribeiro P, Silva F, Kaiser M (2009) Strategies for network motifs discovery. In: 5th IEEE international conference on e-science. IEEE Computer Society Press, Oxford, pp 80–87Google Scholar
  44. Ribeiro P, Silva F, Lopes L (2010) Efficient parallel subgraph counting using g-tries. In: IEEE International conference on cluster computing (Cluster). IEEE Computer Society Press, pp 1559–1566Google Scholar
  45. Ribeiro P, Silva F, Lopes L (2012) Parallel discovery of network motifs. J Parallel Distrib Comput 72:144–154CrossRefGoogle Scholar
  46. Schreiber F, Schwobbermeyer H (2004) Towards motif detection in networks: frequency concepts and flexible search. In: International workshop on network tools and applications in biology (NETTAB), pp 91–102Google Scholar
  47. Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68CrossRefGoogle Scholar
  48. Sporns O, Kotter R (2004) Motifs in brain networks. PLoS Biol 2(11):e369. doi:10.1371/journal.pbio.0020369
  49. Tarjan R (1971) Depth-first search and linear graph algorithms. In: Annual IEEE symposium on foundations of computer science. IEEE Computer Society, Los Alamitos, pp 114–121Google Scholar
  50. Valverde S, Solé RV (2005) Network motifs in computational graphs: A case study in software architecture. Phys Rev E 72(2), 026107. doi:10.1103/PhysRevE.72.026107
  51. Wang C, Parthasarathy S (2004) Parallel algorithms for mining frequent structural motifs in scientific data. In: ACM International conference on supercomputing (ICS)Google Scholar
  52. Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442CrossRefGoogle Scholar
  53. Wernicke S (2005) A faster algorithm for detecting network motifs. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 3692, pp. 165–177Google Scholar
  54. Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359CrossRefGoogle Scholar
  55. White JG, Southgate E, Thomson JN, Brenner S (1986) The structure of the nervous system of the Nematode Caenorhabditis elegans. Philos Trans R Soc London B Biol Sci 314(1165):1–340Google Scholar
  56. Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 721Google Scholar
  57. Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp 335–346Google Scholar
  58. Yuan D, Mitra P (2011) A lattice-based graph index for subgraph search. In: 14th International workshop on the web and databases (WebDB)Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.CRACS & INESC-TEC, Faculdade de CienciasUniversidade do PortoPortoPortugal

Personalised recommendations