Abstract
We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.
Similar content being viewed by others
References
Afrati FN, Fotakis D, Ullman JD (2013) Enumerating subgraph instances using map-reduce. In: IEEE international conference on data engineering
Bhuiyan M, Al Hasan M (2015) An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans Knowl Data Eng 27(3):608–620
Bringmann B, Nijssen S (2008) What is frequent in a single graph? In: Pacific-Asia conference on advances in knowledge discovery and data mining
Buehrer G, Parthasarathy S, Chen Y-K (2006) Adaptive parallel graph mining for cmp architectures. In: IEEE international conference on data mining
Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) Grami: frequent subgraph and pattern mining in a single large graph. Proc VLDB Endow 7:517–528
Fatta GD, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785
Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: ACM conference on bioinformatics, computational biology and biomedicine
Holder LB, Cook DJ (1993) Discovery of inexact concepts from structural data. IEEE Trans Knowl Data Eng 5(6):992–994
Huan J, Wang W, Prins J(2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data mining
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Principles of data mining and knowledge discovery. LNCS vol. 1910. Springer, pp 13–23
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Kessl R, Talukder N, Anchuri P, Zaki MJ (2014) Parallel graph mining with GPUs. Proceedings of the BigMine workshop (ACM SIGKDD), Journal of Machine Learning Research: conference and workshop proceedings, pp 36:1–16
Kimelfeld B, Kolaitis PG (2014) The complexity of mining maximal frequent subgraphs. ACM Trans Database Syst (TODS) 39(4):32
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: IEEE international conference on data mining
Kuramochi M, Karypis G (2005) Finding frequent patterns in a large sparse graph. Data Min Knowl Discov 11(3):243–271
Lin W, Xiao X, Ghinita G (2014) Large-scale frequent subgraph mining in mapreduce. In: IEEE international conference on data engineering
Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Advanced parallel processing technologies, LNCS vol. 5737. Springer, pp 341–355
Lu W, Chen G, Tung A, Zhao F(2013) Efficiently extracting frequent subgraphs using mapreduce. In: IEEE international conference on big data
Meinl T, Wörlein M, Fischer I, Philippsen M (2006) Mining molecular datasets on symmetric multiprocessor systems. In: IEEE international conference on systems, man and cybernetics, vol 2
Reinhardt S, Karypis G (2007) A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In: IEEE international parallel and distributed processing symposium
Shahrivari S, Jalili S (2015) Distributed discovery of frequent subgraphs of a network using MapReduce. Computing 97(11):1101–1120
Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: ACM SIGMOD international conference on management of data
Sun Z, Wang H, Wang H, Shao B, Li J (2012) Efficient subgraph matching on billion node graphs. Proc VLDB Endow 5(9):788–799
Teixeira CHC, Fonseca AJ, Serafini M, Siganos G, Zaki MJ, Aboulnaga A (2015) Arabesque: a system for distributed graph pattern mining. In: 25th ACM symposium on operating systems principles
Ucar D, Asur S, Catalyurek U, Parthasarathy S (2006) Improving functional modularity in protein–protein interactions graphs using hub-induced subgraphs. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 371–382
Wu B, Bai Y (2010) An efficient distributed subgraph mining algorithm in extreme large graphs. In: International conference on artificial intelligence and computational intelligence: part I
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: IEEE international conference on data mining
Yang G (2004) The complexity of mining maximal frequent itemsets and maximal frequent patterns. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 344–353
Acknowledgments
This work was supported by NSF Award IIS-1302231. We thank Chris Carothers and Bulent Yener for several discussions on the practical and theoretical aspects of our distributed algorithm.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.
Rights and permissions
About this article
Cite this article
Talukder, N., Zaki, M.J. A distributed approach for graph mining in massive networks. Data Min Knowl Disc 30, 1024–1052 (2016). https://doi.org/10.1007/s10618-016-0466-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0466-x