Abstract
The second-order random walk has recently been shown to effectively improve the accuracy in graph analysis tasks. Existing work mainly focuses on centralized second-order random walk (SOW) algorithms. SOW algorithms rely on edge-to-edge transition probabilities to generate next random steps. However, it is prohibitively costly to store all the probabilities for large-scale graphs, and restricting the number of probabilities to consider can negatively impact the accuracy of graph analysis tasks. In this paper, we propose and study an alternative approach, SOOP (second-order random walks with on-demand probability computation), that avoids the space overhead by computing the edge-to-edge transition probabilities on demand during the random walk. However, the same probabilities may be computed multiple times when the same edge appears multiple times in SOW, incurring extra cost for redundant computation and communication. We propose two optimization techniques that reduce the complexity of computing edge-to-edge transition probabilities to generate next random steps, and reduce the cost of communicating out-neighbors for the probability computation, respectively. Our experiments on real-world and synthetic graphs show that SOOP achieves orders of magnitude better performance than baseline precompute solutions, and it can efficiently computes SOW algorithms on billion-scale graphs.
Similar content being viewed by others
References
Grover A, Leskovec J. Node2Vec: Scalable feature learning for networks. In Proc. the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, pp.855-864. https://doi.org/10.1145/2939-672.2939754.
Wu Y, Bian Y, Zhang X. Remember where you came from: On the second-order random walk based proximity measures. Proceedings of the VLDB Endowment, 2016, 10(1): 13-24. https://doi.org/10.14778/3015270.3015272.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs//1301.3781, March 2021.
Tsoumakas G, Katakis I. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 2007, 3(3): 1-13. https://doi.org/10.4018/jdwm.2007070101.
Liben-Nowell D, Kleinberg J M. The link prediction problem for social networks. In Proc. the 2003 ACM CIKM International Conference on Information and Knowledge Management, November 2003, pp.556-559. https://doi.org/10.1145/956863.956972.
Tang L, Liu H. Leveraging social media networks for classification. Data Min. Knowl. Discov., 2011, 23(3): 447-478. https://doi.org/10.1007/s10618-010-0210-x.
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations. In Proc. the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2014, pp.701-710. https://doi.org/10.1145/2623330.2623732.
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. LINE: Large-scale information network embedding. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1067-1077. https://doi.org/10.1145/2736277.2741093.
Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146. https://doi.org/10.1145/1807167.1807184.
Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012, pp.17-30.
Salihoglu S, Widom J. GPS: A graph processing system. In Proc. the 25th International Conference on Scientific and Statistical Database Management, July 2013, Article No. 22. https://doi.org/10.1145/2484838.2484843.
Tian Y, Balmin A, Corsten S A, Tatikonda S, McPherson J. From “think like a vertex” to “think like a graph”. Proceedings of the VLDB Endowment, 2013, 7(3): 193-204. https://doi.org/10.14778/2732232.2732238.
Xin R S, Gonzalez J E, Franklin M J, Stoica I. Graphx: A resilient distributed graph system on Spark. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No. 2. https://doi.org/10.1145/2484425.2484427.
Yan D, Cheng J, Lu Y, Ng W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 2014, 7(14): 1981-1992. https://doi.org/10.14778/2733085.2733103.
Chen R, Shi J, Chen Y, Chen H. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 1. https://doi.org/10.1145/2741948.2741970.
Zhu X, Chen W, Zheng W, Ma X. Gemini: A computation-centric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.
Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Cao Y, Tian C. Parallelizing sequential graph computations. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.495-510. https://doi.org/10.1145/303-5918.3035942.
Iosup A, Hegeman T, Ngai W L, Heldens S, Prat-Pérez A, Manhardt T, Chafi H, Capotă M, Sundaram N, Anderson M J, Tanase I G, Xia Y, Nai L, Boncz P A. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. Proceedings of the VLDB Endowment, 2016, 9(13): 1317-1328. https://doi.org/10.14778/3007263.3007270.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.
Zhou D, Niu S, Chen S. Efficient graph computation for Node2Vec. arXiv:1805.00280, 2018. https://arxiv.org/abs/1805.00280, March 2021.
Andersen R, Chung F R K, Lang K J. Local graph partitioning using PageRank vectors. In Proc. the 47th Annual IEEE Symposium on Foundations of Computer Science, October 2006, pp.475-486. https://doi.org/10.1109/FOCS.2006.44.
Yang K, Zhang M, Chen K, Ma X, Bai Y, Jiang Y. KnightKing: A fast distributed graph random walk engine. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.524-537. https://doi.org/10.1145/334-1301.3359634.
Vose M D. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Eng., 1991, 17(9): 972-975. https://doi.org/10.1109/32.92917.
Niu S, Chen S. Optimizing CPU cache performance for Pregel-like graph computation. In Proc. the 31st IEEE International Conference on Data Engineering Workshops, April 2015, pp.149-154. https://doi.org/10.1109/ICDE-W.2015.7129568.
Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. In Proc. the 12th IEEE International Conference on Data Mining, December 2012, pp.745-754. https://doi.org/10.1109/ICDM.2012.138.
Boldi P, Vigna S. The WebGraph framework I: Compression techniques. In Proc. the 13th International World Wide Web Conference, May 2004, pp.595-601. https://doi.org/10.1145/988672.988752.
Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A recursive model for graph mining. In Proc. the 4th SIAM International Conference on Data Mining, April 2004, pp.442-446. https://doi.org/10.1137/1.9781611972740.43.
Park H, Kim M. TrillionG: A trillion-scale synthetic graph generator using a recursive vector model. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.913-928. https://doi.org/10.1145/3035918.3064014.
Author information
Authors and Affiliations
Corresponding author
Supplementary Information
ESM 1
(PDF 151 kb)
Rights and permissions
About this article
Cite this article
Niu, S., Zhou, D. SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks. J. Comput. Sci. Technol. 36, 985–1001 (2021). https://doi.org/10.1007/s11390-021-1234-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-021-1234-y